Introduction

The aim of this part of the project is to use occurrence data of bivalves from across the Cretaceous–Palaeogene boundary to reconstruct extinction prevalence and potential drivers of this.

Project environment

renv::restore()

Renv, used above, sets up a project environment that stores the packages and versions used to make the project work. The idea of this is to enable you to go back to this after some time and get the project working again, no matter what’s changed or been updated in the meantime. Similarly, send the project to a collaborator and they can set things up to be working quickly and easily without having to search through packages to install. I’ve already initialized this project, to get things going on your computer should a matter of opening R in this filter. Renv will then install itself and any packages needed into its own library.

The following packages are used:

  • CoordinateCleaner: for checking occurrence data quality (Zizka et al. 2019) (requires proj and gdal).
  • countrycode: for cleaning occurrence data (Arel-Bundock, Enevoldsen, and Yetman 2018).
  • jsonlite: reading JSON-structured data (Ooms 2014).
  • rnaturalearth: country outlines for plotting maps (South 2017a).
  • rnaturaleearthdata: for checking occurrence localities (South 2017b).
  • sf: wrangling geometry and map data (Pebesma 2018).
  • tidyverse: data wrangling pipelines (Wickham et al. 2019).
  • viridis: colour blind-friendly plotting (Garnier et al. 2021).
library(ncdf4)
library(tidyverse)
library(brms)
library(bayesplot)
library(CoordinateCleaner)
library(countrycode)
library(GeoRange)
library(jsonlite)
library(rgeos)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(stars)
library(stringdist)
library(viridis)
library(ncdump) # needs to go after tidyverse I think
library(lattice)

options(mc.cores = parallel::detectCores())
rstan::rstan_options(auto_write = TRUE)

Useful background and sites

I’m rather opinionated about code and formatting. I think it’s generally better to be strict about the packages you use and how you write the code. I tend to use tidyverse style and formatting, which has a guide written by Hadley Wickham, who’s created some of the most used R packages. Google have a style guide for writing R code that I recommend following, but as a guide can and should be changed to meet your needs.

I use piping a lot. This means passing the result of one function as the first argument of the next line. R has a native version shown as %>%, but this is rather new (version 4.1.0). tidyverse and package magrittr has had a different version for while that looks like %>% and is probably the better to use, so you might see that online more often.

This guide is written in R Markdown, a text format that combines text and code and can be run easily in R Studio and other environments. I also tend to use ggplot for plotting, so have a look here for the low-down on that package.

Acquiring data

The data come from the Paleobiology Database (PBDB). Occurrences are downloaded though their API.

First we can download the data as a JSON file and save this to disc: this is our master data set. For this, I’ve gathered occurrences of Bivalvia from the Cretaceous and Palaeogene with all of the occurrence, geographical, systematic, and ecological data. This only need be done once (and I’ve included the file in the repository already).

bivalve_pbdb_url <-
  paste0(
    "https://paleobiodb.org/data1.2/", # PBDB API URL
    "occs/", # occurrences
    "list.json?", # download a JSON file
    "base_name=Bivalvia&", # download bivalve occurrences
    "interval=Cretaceous,Paleogene&", # temporal range to include
    "show=full" # include full data
  )
bivalve_pbdb_file <- paste0(Sys.Date(), "-bivalve_pbdb_occurrences.json")
download.file(bivalve_pbdb_url, bivalve_pbdb_file, method = "curl")

The data file can then be loaded into R, which gives a list with length two: the first element ($elapsed_time) gives the original download time and the second element ($records) is a data.frame of the PBDB occurrence data. We keep just this second part. To save space, I’ve downloaded a version and stored this as a compressed tar file that needs to be expanded first with untar.

bivalve_pbdb_file <- "2022-03-18-bivalve_pbdb_occurrences"
untar(paste0(bivalve_pbdb_file, ".tgz"))
bivalve_pbdb_data <-
  jsonlite::fromJSON(paste0(bivalve_pbdb_file, ".json"))$records %>%
  as_tibble() %>%
  mutate(across(c(rnk, lng, lat), as.numeric))

We won’t need data from all the columns, but the important few are:

A full list of the fields is at the bottom of this page.

Querying data

It’s a good check to have a quick look at the data to check its size and what different values are contained within. Most of the PBDB data are categorical or strings, which makes it a little more difficult (you can’t plot a bar chart for instance, except you can). A few things to check:

bivalve_pbdb_data %>%
  filter(rnk == 3) %>%
  group_by(tna) %>%
  summarize(n())
## # A tibble: 3,935 × 2
##    tna                                  `n()`
##    <chr>                                <int>
##  1 Abra (Abra) petropolitana                6
##  2 Abra (Syndosmya) cygnea                  1
##  3 Abra (Syndosmya) splendens               4
##  4 Abra deshayesi                           2
##  5 Abra nitens                             40
##  6 Abra recluzii                            1
##  7 Abra suessoniensis                       2
##  8 Acanthocardia (Acanthocardia) reedi      2
##  9 Acanthocardia (Schedocardia) tuomeyi    31
## 10 Acanthocardia becksii                    2
## # … with 3,925 more rows

Or a more fun thing to plot the occurrences from different families.

family_occurrences <-
  bivalve_pbdb_data %>%
    filter(rnk == c(3, 4, 5)) %>%
    group_by(fml) %>%
    summarize(family_occurrences = n())
family_occurrences %>%
  ggplot(
    aes(x = reorder(fml, desc(family_occurrences)), y = family_occurrences)
  ) +
    geom_col()

family_occurrences %>%
  slice_max(family_occurrences, n = 10)
## # A tibble: 12 × 2
##    fml          family_occurrences
##    <chr>                     <int>
##  1 Veneridae                  1593
##  2 Inoceramidae               1193
##  3 Ostreidae                  1153
##  4 Gryphaeidae                1122
##  5 Corbulidae                  991
##  6 Pectinidae                  972
##  7 Cardiidae                   971
##  8 Carditidae                  859
##  9 Lucinidae                   777
## 10 Nuculanidae                 774
## 11 Radiolitidae                774
## 12 Tellinidae                  774

Data quality

Now we can do a more thorough look through the data to check that it’s correct, or at least in the form we expect it to be. Fortunately there is a handy package, CoordinateCleaner, that can help with this. It checks for coordinates that are present in the middle of countries or at (0, 0) — implying they have imprecise, or generic location data — or those occurrences that are spatiotemporal outliers — indicating they may not be correct identifications. Q guide through these checks is here.

Spatial checks

This first step checks that the coordinates are included as numbers and that there aren’t any potential errors — such as matching latitude and longitude values. It’s recommended to also check which record, if any, are flagged by adding value = "flagged" to all commands.

occ_checking <-
  bivalve_pbdb_data %>%
  cc_val(lat = "lat", lon = "lng") %>%
  cc_equ(lat = "lat", lon = "lng")
## Testing equal lat/lon
## Testing coordinate validity
## Removed 0 records.
## Removed 0 records.

Next we check for the location proximity to country centres, indicating imprecision.

occ_precision <-
  occ_checking %>%
  cc_cen(lat = "lat", lon = "lng", value = "flagged")
## Testing country centroids
## Flagged 203 records.
occ_precision_fl <-
  bivalve_pbdb_data %>%
  filter(!occ_precision)

Then we check whether occurrences are located within the country they are meant to come from. This requires using ISO 3-letter country codes, while the PBDB uses 2-letter codes, so some conversion is needed; it also uses UK while the country codes is GB and GBR.

# Add record for UK -> GBR, AA -> ATA [Antarctica]
cs_ma <- c("GBR", "ATA")
names(cs_ma) <- c("UK", "AA")
occ_checking <-
  occ_checking %>%
  mutate(
    cc_iso3 = countrycode(
        cc2,
        origin = "iso2c",
        destination = "iso3c",
        custom_match = cs_ma
      )
  )

occ_country <-
  occ_checking %>%
  cc_coun(lat = "lat", lon = "lng", iso3 = "cc_iso3", value = "flagged")
## Testing country identity
## Flagged 6713 records.
occ_country_fl <-
  bivalve_pbdb_data %>%
  filter(!occ_country)

Fourth, is to check against host institutions on the Global Biodiversity Interchange Facility, in case the locations are listed as that rather than the in-the-field locality.

occ_inst <-
  occ_checking %>%
  cc_inst(lat = "lat", lon = "lng", value = "flagged")
## Testing biodiversity institutions
## Flagged 0 records.
occ_gbif <-
  occ_checking %>%
  cc_gbif(lat = "lat", lon = "lng", value = "flagged")
## Testing GBIF headquarters, flagging records around Copenhagen
## Flagged 5 records.

And finally, check for records at (0, 0).

occ_zero_zero <-
  occ_checking %>%
  cc_zero(lat = "lat", lon = "lng", value = "flagged")
## Testing zero coordinates
## Flagged 0 records.

Temporal checks

Alongside check that occurrences are not at potentially imprecise/incorrect localities, also see whether any are erroneous separated in time is useful. First removing occurrences without ages.

occ_empty_ages <-
  occ_checking %>%
    filter(
      !is.na(lag),
      !is.na(eag)
    ) %>%
  cf_equal(min_age = "lag", max_age = "eag", value = "flagged")
## Testing age validity
## Flagged 0 records.

Next, check on the precision of dating as the very imprecisely dated occurrences can be readily excluded.

bivalve_pbdb_data %>%
  mutate(dating_precision = eag - lag) %>%
  ggplot(aes(dating_precision)) +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histogram of Bivalvia occurrence dating ranges from the Paleobiology Database.

Histogram of Bivalvia occurrence dating ranges from the Paleobiology Database.

We’re mostly doing okay, with precision less than 20 Ma, but some extend all the way up to 140 Ma dates; we don’t want those!CoordinateCleaner also provides a way to clean based on range comparisons between taxa and comparing similarities — excluding occurrences based on the range of differences of dating precision rather than just setting an arbitrary cutoff. But an absolute value can sometimes be useful: in this case using 35 Ma as the length of the Late Cretaceous. If occurrences can’t certainly be dated within that time frame than we can readily exclude them. As a check of how removing these occurrences affects the spread of ranges, we plot the histogram again.

occ_ranges_total <-
  occ_checking %>%
  cf_range(taxon = "", min_age = "lag", max_age = "eag", value = "flagged")
## Testing temporal range outliers on dataset level
## Flagged 979 records.
occ_ranges_taxon <-
  occ_checking %>%
  cf_range(taxon = "tna", min_age = "lag", max_age = "eag", value = "flagged")
## Testing temporal range outliers on taxon level
## Flagged 2122 records.
occ_ranges_absolute <-
  occ_checking %>%
  cf_range(
    taxon = "tna",
    min_age = "lag", max_age = "eag",
    method = "time", max_range = 35, value = "flagged"
  )
## Testing temporal range outliers on taxon level
## Flagged 150 records.
bivalve_pbdb_data %>%
  filter(occ_ranges_total & occ_ranges_taxon & occ_ranges_absolute) %>%
  mutate(dating_precision = eag - lag) %>%
  ggplot(aes(dating_precision)) +
    geom_histogram(bins = 50)
Histogram of cleaned occurrence dating ranges.

Histogram of cleaned occurrence dating ranges.

Most of the occurrences in this data set can be dated within 10 Ma, so that’s good.

Now for checking outliers in time and space — those that are much separated from their main cluster and so may be a misidentification.

# this function exhausted memory, so I don't run it
# occ_outliers_total <-
#   occ_checking %>%
#   cf_outl(
      # taxon = "",
      # lat = "lat", lon = "lng",
      # min_age = "lag", max_age = "eag",
      # value = "flagged"
    # )

occ_outliers_taxon <-
  occ_checking %>%
  cf_outl(
    taxon = "tna",
    lat = "lat", lon = "lng",
    min_age = "lag", max_age = "eag",
    value = "flagged"
  )
## Testing spatio-temporal outliers on taxon level
## Flagged 873 records.

Excluding questionable data

The above functions let you go through and check the removed occurrences for the reasons and to make sure they should be removed. This code below does all the cleaning automatically using the helper function clean_fossils.

cs_ma <- c(
  "UK" = "GBR",
  "AA" = "ATA"
)
bivalve_pbdb_data <-
  bivalve_pbdb_data %>%
  mutate(
    cc_iso3 = countrycode(cc2,
      origin = "iso2c", destination = "iso3c", custom_match = cs_ma
    )
  )

bivalve_cleaned <-
  bivalve_pbdb_data %>%
  clean_fossils(
    taxon = "tna",
    min_age = "lag", max_age = "eag",
    lon = "lng", lat = "lat",
    countries = "cc_iso3", value = "clean"
  ) %>%
  filter(str_detect(jev, "marine|shelf|oceanic"))
## Testing coordinate validity
## Flagged 0 records.
## Testing equal lat/lon
## Flagged 0 records.
## Testing zero coordinates
## Flagged 0 records.
## Testing country centroids
## Flagged 120 records.
## Testing spatio-temporal outliers on taxon level
## Warning in cf_outl(x = x, lon = lon, lat = lat, min_age = min_age, max_age
## = max_age, : decimallatitude not found. Using lat instead.
## Warning in cf_outl(x = x, lon = lon, lat = lat, min_age = min_age, max_age
## = max_age, : decimallongitude not found. Using lng instead.
## Flagged 884 records.
## Warning in cf_range(x = test, taxon = "", min_age = min_age, max_age =
## max_age, : decimallatitude not found. Using lat instead.
## Warning in cf_range(x = test, taxon = "", min_age = min_age, max_age =
## max_age, : decimallongitude not found. Using lng instead.
## Testing temporal range outliers on dataset level
## Flagged 979 records.
## Warning in cf_range(ran.test, taxon = taxon, min_age = min_age, max_age =
## max_age, : decimallatitude not found. Using lat instead.
## Warning in cf_range(ran.test, taxon = taxon, min_age = min_age, max_age =
## max_age, : decimallongitude not found. Using lng instead.
## Testing temporal range outliers on taxon level
## Flagged 2097 records.
## Testing age validity
## Flagged 0 records.
## Testing GBIF headquarters, flagging records around Copenhagen
## Flagged 5 records.
## Testing biodiversity institutions
## Flagged 0 records.
## Flagged 4002 of 74262 records, EQ = 0.05

That last line filters only those occurrences from marine/shelf/oceanic environments, based on the lithology. Have a look through the column and see whether you want to include any other environments; separate them with the vertical bar — |, meaning ‘or’ in regular expressions. The clean_fossils function outputs useful messages to show what stage it’s at; I get some warnings about decimallatitude and decimallongitude, but I don’t think that’s anything to worry about.

Checking taxonomy

A final set of things to check is for names that might have misspellings or missing taxonomy data.

Remove subgenus names

One of the features of many names is including a subgenus name in parentheses. While taxonomically important, this can be problematic as some taxa will include the subgenus, while others may not depending on the taxonomy. We thus remove the subgenus name. At this point, it’s also useful to convert the strings to lowercase to remove any errors from capitalization. Searching and modifying strings in R uses a system ‘regular expressions’: a series of indicators to select certain groups of characters. In the example below1, we’re looking for a set of lowercase letters1, surrounded by parentheses and followed by a space. Have a look at this cheat sheet for stringr (there are cheat sheets for some other packages to look at here: https://www.rstudio.com/resources/cheatsheets/).

bivalve_cleaned <-
  bivalve_cleaned %>%
  mutate(tna_trimmed = str_remove_all(str_to_lower(tna), "\\(([:lower:]+)\\)\\s*"))

Check spelling differences

Next, we check for spelling differences. Here, we use the distance between strings, in essence the number of characters different, to check whether there are misspelled names. Package stringdist has the function stringdistmatrix that calculates the pairwise distances within a string vector, thus showing the most similarly-spelled taxon names. Using an arbitrary distance cutoff — in this case 4, but string_cutoff can be changed — will show the most similar strings.

str_dist_matrix <-
  bivalve_cleaned %>%
  filter(rnk == 3) %>%
  pull(tna_trimmed) %>%
  unique() %>%
  stringdistmatrix(useNames = "strings", method = "lcs") %>%
  as.matrix()

string_cutoff <- 4

str_indices <-
  tibble(
    pos_n       = which(str_dist_matrix <= string_cutoff & str_dist_matrix > 0),
    row_n       = pos_n %/% dim(str_dist_matrix)[2],
    row_n_fixed = replace(row_n, row_n == 0, dim(str_dist_matrix)[2]) + 1, # need to add 1 to offset mod(1583) = 0
    col_n       = pos_n %%  dim(str_dist_matrix)[1],
    col_n_fixed = replace(col_n, col_n == 0, dim(str_dist_matrix)[1]),
    row         = rownames(str_dist_matrix)[row_n_fixed],
    col         = colnames(str_dist_matrix)[col_n_fixed],
    distance    = str_dist_matrix[pos_n]
  ) %>%
  arrange(distance)
write_csv(str_indices, "./output/similar_spellings.csv")

At this point, we check the names against accepted names in WoRMS, LifeWatch, GBIF, or Treatise on Invertebrate Paleontology. Then, we use a conversion vector to correct these in the data: the incorrect version is before the equals and the corrected version after. Note this output table shows differences both ways (row and column), so each minor difference is shown twice with the names switched.

name_corrections <-
  c(
    "syncyclonema haggi" = "syncyclonema haeggi",
    "incorrect" = "correct"
  )

bivalve_cleaned <-
  bivalve_cleaned %>%
  mutate(tna_trimmed = str_replace_all(tna_trimmed, name_corrections))

We now use tna_trimmed as the main taxon name-column.

Check family name validity

We can check what families are in included in the data quite easily but pulling this column out and finding unique records.

bivalve_cleaned %>%
  pull(fml) %>%
  unique()
##  [1] "Nuculidae"           "Nuculanidae"         "Pteriidae"          
##  [4] "Pectinidae"          "Yoldiidae"           "Glycymerididae"     
##  [7] "Pinnidae"            "Limopsidae"          "Arcidae"            
## [10] "Mytilidae"           "Entoliidae"          "Anomiidae"          
## [13] "Limidae"             "Gryphaeidae"         "Ostreidae"          
## [16] "Flemingostreidae"    "Solemyidae"          "Antillocaprinidae"  
## [19] "Bakevelliidae"       "Plicatulidae"        "Neitheidae"         
## [22] "Pulvinitinae"        "Dimyidae"            "Monopleuridae"      
## [25] "Spondylidae"         "Malleidae"           "Propeamussiidae"    
## [28] "NO_FAMILY_SPECIFIED" "Radiolitidae"        "Requieniidae"       
## [31] "Hippuritidae"        "Terquemiidae"        "Trechmannellidae"   
## [34] "Buchiidae"           "Caprinidae"          "Plagioptychidae"    
## [37] "Oxytomidae"          "Caprinuloideidae"    "Heteropectinidae"   
## [40] "Caprinulidae"        "Epidiceratidae"      "Polyconitidae"      
## [43] "Ichthyosarcolitidae" NA                    "Philobryidae"       
## [46] "Manzanellidae"       "Palaeolophidae"      "Diceratidae"        
## [49] "Cardiidae"           "Cassianellidae"      "Posidoniidae"       
## [52] "Pergamidiidae"       "Chondrodontidae"     "Aviculopectinidae"  
## [55] "Pseudomonotidae"     "Caprotinidae"

We see that there are two errant values: “NO_FAMILY_SPECIFIED” and NA.

errant_taxonomy <-
  bivalve_cleaned %>%
  filter(fml == "NO_FAMILY_SPECIFIED" | is.na(fml)) %>%
  select(tna_trimmed, fml, rnk)
print(errant_taxonomy, n = Inf)
## # A tibble: 387 × 3
##     tna_trimmed                  fml                   rnk
##     <chr>                        <chr>               <dbl>
##   1 "entolium"                   NO_FAMILY_SPECIFIED     5
##   2 "entolium"                   NO_FAMILY_SPECIFIED     5
##   3 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##   4 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##   5 "entolium"                   NO_FAMILY_SPECIFIED     5
##   6 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##   7 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##   8 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##   9 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  10 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  11 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  12 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  13 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  14 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  15 "entolium"                   NO_FAMILY_SPECIFIED     5
##  16 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  17 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  18 "entolium"                   NO_FAMILY_SPECIFIED     5
##  19 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  20 "entolium"                   NO_FAMILY_SPECIFIED     5
##  21 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  22 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  23 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  24 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  25 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  26 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  27 "balabania"                  NO_FAMILY_SPECIFIED     5
##  28 "balabania"                  NO_FAMILY_SPECIFIED     5
##  29 "balabania"                  NO_FAMILY_SPECIFIED     5
##  30 "balabania"                  NO_FAMILY_SPECIFIED     5
##  31 "dechaseauxia"               NO_FAMILY_SPECIFIED     5
##  32 "hardaghia"                  NO_FAMILY_SPECIFIED     5
##  33 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  34 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  35 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  36 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  37 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  38 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  39 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  40 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  41 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  42 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  43 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  44 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  45 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  46 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  47 "entolium"                   NO_FAMILY_SPECIFIED     5
##  48 "entolium"                   NO_FAMILY_SPECIFIED     5
##  49 "entolium"                   NO_FAMILY_SPECIFIED     5
##  50 "entolium"                   NO_FAMILY_SPECIFIED     5
##  51 "entolium"                   NO_FAMILY_SPECIFIED     5
##  52 "entolium"                   NO_FAMILY_SPECIFIED     5
##  53 "entolium"                   NO_FAMILY_SPECIFIED     5
##  54 "entolium"                   NO_FAMILY_SPECIFIED     5
##  55 "entolium demissus"          NO_FAMILY_SPECIFIED     3
##  56 "entolium"                   NO_FAMILY_SPECIFIED     5
##  57 "entolium demissus"          NO_FAMILY_SPECIFIED     3
##  58 "entolium"                   NO_FAMILY_SPECIFIED     5
##  59 "entolium"                   NO_FAMILY_SPECIFIED     5
##  60 "entolium demissus"          NO_FAMILY_SPECIFIED     3
##  61 "entolium"                   NO_FAMILY_SPECIFIED     5
##  62 "entolium"                   NO_FAMILY_SPECIFIED     5
##  63 "entolium"                   NO_FAMILY_SPECIFIED     5
##  64 "entolium"                   NO_FAMILY_SPECIFIED     5
##  65 "entolium"                   NO_FAMILY_SPECIFIED     5
##  66 "entolium"                   NO_FAMILY_SPECIFIED     5
##  67 "entolium demissus"          NO_FAMILY_SPECIFIED     3
##  68 "entolium"                   NO_FAMILY_SPECIFIED     5
##  69 "entolium"                   NO_FAMILY_SPECIFIED     5
##  70 "entolium"                   NO_FAMILY_SPECIFIED     5
##  71 "entolium"                   NO_FAMILY_SPECIFIED     5
##  72 "entolium"                   NO_FAMILY_SPECIFIED     5
##  73 "entolium"                   NO_FAMILY_SPECIFIED     5
##  74 "entolium"                   NO_FAMILY_SPECIFIED     5
##  75 "entolium"                   NO_FAMILY_SPECIFIED     5
##  76 "entolium"                   NO_FAMILY_SPECIFIED     5
##  77 "entolium"                   NO_FAMILY_SPECIFIED     5
##  78 "entolium"                   NO_FAMILY_SPECIFIED     5
##  79 "entolium"                   NO_FAMILY_SPECIFIED     5
##  80 "entolium"                   NO_FAMILY_SPECIFIED     5
##  81 "entolium"                   NO_FAMILY_SPECIFIED     5
##  82 "entolium"                   NO_FAMILY_SPECIFIED     5
##  83 "entolium"                   NO_FAMILY_SPECIFIED     5
##  84 "entolium"                   NO_FAMILY_SPECIFIED     5
##  85 "entolium"                   NO_FAMILY_SPECIFIED     5
##  86 "entolium"                   NO_FAMILY_SPECIFIED     5
##  87 "entolium"                   NO_FAMILY_SPECIFIED     5
##  88 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  89 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  90 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  91 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  92 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  93 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  94 "entolium"                   NO_FAMILY_SPECIFIED     5
##  95 "entolium"                   NO_FAMILY_SPECIFIED     5
##  96 "ostreidina"                 <NA>                   12
##  97 "ostreidina"                 <NA>                   12
##  98 "megadiceras"                NO_FAMILY_SPECIFIED     5
##  99 "megadiceras"                NO_FAMILY_SPECIFIED     5
## 100 "hippuritacea"               <NA>                   10
## 101 "hippuritacea"               <NA>                   10
## 102 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 103 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 104 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 105 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 106 "hippuritida"                <NA>                   13
## 107 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 108 "entolium"                   NO_FAMILY_SPECIFIED     5
## 109 "entolium"                   NO_FAMILY_SPECIFIED     5
## 110 "entolium"                   NO_FAMILY_SPECIFIED     5
## 111 "entolium"                   NO_FAMILY_SPECIFIED     5
## 112 "entolium"                   NO_FAMILY_SPECIFIED     5
## 113 "entolium"                   NO_FAMILY_SPECIFIED     5
## 114 "entolium"                   NO_FAMILY_SPECIFIED     5
## 115 "entolium"                   NO_FAMILY_SPECIFIED     5
## 116 "entolium"                   NO_FAMILY_SPECIFIED     5
## 117 "entolium"                   NO_FAMILY_SPECIFIED     5
## 118 "entolium"                   NO_FAMILY_SPECIFIED     5
## 119 "entolium"                   NO_FAMILY_SPECIFIED     5
## 120 "entolium"                   NO_FAMILY_SPECIFIED     5
## 121 "entolium"                   NO_FAMILY_SPECIFIED     5
## 122 "entolium"                   NO_FAMILY_SPECIFIED     5
## 123 "entolium utukokense"        NO_FAMILY_SPECIFIED     3
## 124 "entolium utukokense"        NO_FAMILY_SPECIFIED     3
## 125 "entolium utukokense"        NO_FAMILY_SPECIFIED     3
## 126 "entolium utukokense"        NO_FAMILY_SPECIFIED     3
## 127 "hippuritida"                <NA>                   13
## 128 "entolium"                   NO_FAMILY_SPECIFIED     5
## 129 "ostreacea"                  <NA>                   10
## 130 "entolium"                   NO_FAMILY_SPECIFIED     5
## 131 "entolium"                   NO_FAMILY_SPECIFIED     5
## 132 "entolium"                   NO_FAMILY_SPECIFIED     5
## 133 "entolium"                   NO_FAMILY_SPECIFIED     5
## 134 "entolium"                   NO_FAMILY_SPECIFIED     5
## 135 "hippuritacea"               <NA>                   10
## 136 "entolium"                   NO_FAMILY_SPECIFIED     5
## 137 "entolium"                   NO_FAMILY_SPECIFIED     5
## 138 "entolium"                   NO_FAMILY_SPECIFIED     5
## 139 "entolium"                   NO_FAMILY_SPECIFIED     5
## 140 "entolium"                   NO_FAMILY_SPECIFIED     5
## 141 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 142 "rhedensia"                  NO_FAMILY_SPECIFIED     5
## 143 "entolium"                   NO_FAMILY_SPECIFIED     5
## 144 "entolium"                   NO_FAMILY_SPECIFIED     5
## 145 "entolium"                   NO_FAMILY_SPECIFIED     5
## 146 "entolium"                   NO_FAMILY_SPECIFIED     5
## 147 "entolium"                   NO_FAMILY_SPECIFIED     5
## 148 "entolium"                   NO_FAMILY_SPECIFIED     5
## 149 "entolium"                   NO_FAMILY_SPECIFIED     5
## 150 "entolium"                   NO_FAMILY_SPECIFIED     5
## 151 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 152 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 153 "entolium"                   NO_FAMILY_SPECIFIED     5
## 154 "entolium"                   NO_FAMILY_SPECIFIED     5
## 155 "praecaprina"                NO_FAMILY_SPECIFIED     5
## 156 "praecaprina"                NO_FAMILY_SPECIFIED     5
## 157 "entolium demissus"          NO_FAMILY_SPECIFIED     3
## 158 "entolium demissus"          NO_FAMILY_SPECIFIED     3
## 159 "hippuritacea"               <NA>                   10
## 160 "entolium "                  NO_FAMILY_SPECIFIED     4
## 161 "ostreida"                   <NA>                   13
## 162 "ostreida"                   <NA>                   13
## 163 "ostreida"                   <NA>                   13
## 164 "ostreida"                   <NA>                   13
## 165 "entolium"                   NO_FAMILY_SPECIFIED     5
## 166 "entolium"                   NO_FAMILY_SPECIFIED     5
## 167 "entolium"                   NO_FAMILY_SPECIFIED     5
## 168 "entolium"                   NO_FAMILY_SPECIFIED     5
## 169 "entolium"                   NO_FAMILY_SPECIFIED     5
## 170 "entolium"                   NO_FAMILY_SPECIFIED     5
## 171 "entolium"                   NO_FAMILY_SPECIFIED     5
## 172 "entolium"                   NO_FAMILY_SPECIFIED     5
## 173 "entolium"                   NO_FAMILY_SPECIFIED     5
## 174 "entolium"                   NO_FAMILY_SPECIFIED     5
## 175 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 176 "tetravaccinites"            NO_FAMILY_SPECIFIED     5
## 177 "pseudopironaea"             NO_FAMILY_SPECIFIED     5
## 178 "entolium"                   NO_FAMILY_SPECIFIED     5
## 179 "entolium"                   NO_FAMILY_SPECIFIED     5
## 180 "hippuritida"                <NA>                   13
## 181 "hippuritida"                <NA>                   13
## 182 "hippuritida"                <NA>                   13
## 183 "hippuritida"                <NA>                   13
## 184 "entolium"                   NO_FAMILY_SPECIFIED     5
## 185 "entolium"                   NO_FAMILY_SPECIFIED     5
## 186 "hippuritida"                <NA>                   13
## 187 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 188 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 189 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 190 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 191 "hippuritida"                <NA>                   13
## 192 "pectinoidea"                <NA>                   10
## 193 "hippuritida"                <NA>                   13
## 194 "hippuritida"                <NA>                   13
## 195 "hippuritida"                <NA>                   13
## 196 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 197 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 198 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 199 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 200 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 201 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 202 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 203 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 204 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 205 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 206 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 207 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 208 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 209 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 210 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 211 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 212 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 213 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 214 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 215 "ostreoidea"                 <NA>                   10
## 216 "ostreoidea"                 <NA>                   10
## 217 "ostreoidea"                 <NA>                   10
## 218 "ostreoidea"                 <NA>                   10
## 219 "ostreoidea"                 <NA>                   10
## 220 "ostreoidea"                 <NA>                   10
## 221 "ostreoidea"                 <NA>                   10
## 222 "ostreoidea"                 <NA>                   10
## 223 "ostreoidea"                 <NA>                   10
## 224 "ostreoidea"                 <NA>                   10
## 225 "ostreoidea"                 <NA>                   10
## 226 "ostreoidea"                 <NA>                   10
## 227 "ostreoidea"                 <NA>                   10
## 228 "ostreoidea"                 <NA>                   10
## 229 "ostreoidea"                 <NA>                   10
## 230 "ostreoidea"                 <NA>                   10
## 231 "ostreoidea"                 <NA>                   10
## 232 "ostreoidea"                 <NA>                   10
## 233 "ostreoidea"                 <NA>                   10
## 234 "hatayia"                    NO_FAMILY_SPECIFIED     5
## 235 "branislavia"                NO_FAMILY_SPECIFIED     5
## 236 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 237 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 238 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 239 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 240 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 241 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 242 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 243 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 244 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 245 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 246 "entolium"                   NO_FAMILY_SPECIFIED     5
## 247 "entolium"                   NO_FAMILY_SPECIFIED     5
## 248 "entolium"                   NO_FAMILY_SPECIFIED     5
## 249 "entolium"                   NO_FAMILY_SPECIFIED     5
## 250 "entolium"                   NO_FAMILY_SPECIFIED     5
## 251 "entolium"                   NO_FAMILY_SPECIFIED     5
## 252 "entolium"                   NO_FAMILY_SPECIFIED     5
## 253 "entolium"                   NO_FAMILY_SPECIFIED     5
## 254 "entolium"                   NO_FAMILY_SPECIFIED     5
## 255 "entolium"                   NO_FAMILY_SPECIFIED     5
## 256 "entolium"                   NO_FAMILY_SPECIFIED     5
## 257 "entolium"                   NO_FAMILY_SPECIFIED     5
## 258 "entolium"                   NO_FAMILY_SPECIFIED     5
## 259 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 260 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 261 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 262 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 263 "entolium"                   NO_FAMILY_SPECIFIED     5
## 264 "entolium"                   NO_FAMILY_SPECIFIED     5
## 265 "entolium"                   NO_FAMILY_SPECIFIED     5
## 266 "entolium"                   NO_FAMILY_SPECIFIED     5
## 267 "entolium"                   NO_FAMILY_SPECIFIED     5
## 268 "entolium"                   NO_FAMILY_SPECIFIED     5
## 269 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 270 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 271 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 272 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 273 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 274 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 275 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 276 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 277 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 278 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 279 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 280 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 281 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 282 "entolium"                   NO_FAMILY_SPECIFIED     5
## 283 "entolium"                   NO_FAMILY_SPECIFIED     5
## 284 "hippuritida"                <NA>                   13
## 285 "hippuritida"                <NA>                   13
## 286 "entolium irenense"          NO_FAMILY_SPECIFIED     3
## 287 "entolium irenense"          NO_FAMILY_SPECIFIED     3
## 288 "hippuritida"                <NA>                   13
## 289 "hippuritacea"               <NA>                   10
## 290 "praecaprina"                NO_FAMILY_SPECIFIED     5
## 291 "ostreida"                   <NA>                   13
## 292 "ostreida"                   <NA>                   13
## 293 "ostreida"                   <NA>                   13
## 294 "ostreida"                   <NA>                   13
## 295 "ostreida"                   <NA>                   13
## 296 "ostreida"                   <NA>                   13
## 297 "ostreida"                   <NA>                   13
## 298 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 299 "hatayia"                    NO_FAMILY_SPECIFIED     5
## 300 "praecaprina"                NO_FAMILY_SPECIFIED     5
## 301 "branislavia"                NO_FAMILY_SPECIFIED     5
## 302 "ostreacea"                  <NA>                   10
## 303 "entolium"                   NO_FAMILY_SPECIFIED     5
## 304 "entolium"                   NO_FAMILY_SPECIFIED     5
## 305 "entolium"                   NO_FAMILY_SPECIFIED     5
## 306 "entolium"                   NO_FAMILY_SPECIFIED     5
## 307 "ostreida"                   <NA>                   13
## 308 "ostreida"                   <NA>                   13
## 309 "entolium"                   NO_FAMILY_SPECIFIED     5
## 310 "cryptaulia"                 NO_FAMILY_SPECIFIED     5
## 311 "ostreida"                   <NA>                   13
## 312 "branislavia"                NO_FAMILY_SPECIFIED     5
## 313 "entolium"                   NO_FAMILY_SPECIFIED     5
## 314 "miseia"                     NO_FAMILY_SPECIFIED     5
## 315 "miseia"                     NO_FAMILY_SPECIFIED     5
## 316 "miseia"                     NO_FAMILY_SPECIFIED     5
## 317 "miseia"                     NO_FAMILY_SPECIFIED     5
## 318 "miseia"                     NO_FAMILY_SPECIFIED     5
## 319 "miseia"                     NO_FAMILY_SPECIFIED     5
## 320 "miseia"                     NO_FAMILY_SPECIFIED     5
## 321 "miseia"                     NO_FAMILY_SPECIFIED     5
## 322 "miseia"                     NO_FAMILY_SPECIFIED     5
## 323 "miseia"                     NO_FAMILY_SPECIFIED     5
## 324 "miseia"                     NO_FAMILY_SPECIFIED     5
## 325 "miseia"                     NO_FAMILY_SPECIFIED     5
## 326 "branislavia"                NO_FAMILY_SPECIFIED     5
## 327 "miseia"                     NO_FAMILY_SPECIFIED     5
## 328 "miseia"                     NO_FAMILY_SPECIFIED     5
## 329 "branislavia"                NO_FAMILY_SPECIFIED     5
## 330 "miseia"                     NO_FAMILY_SPECIFIED     5
## 331 "miseia"                     NO_FAMILY_SPECIFIED     5
## 332 "miseia"                     NO_FAMILY_SPECIFIED     5
## 333 "branislavia"                NO_FAMILY_SPECIFIED     5
## 334 "hippuritida"                <NA>                   13
## 335 "hippuritida"                <NA>                   13
## 336 "hippuritida"                <NA>                   13
## 337 "hippuritida"                <NA>                   13
## 338 "hippuritida"                <NA>                   13
## 339 "hippuritida"                <NA>                   13
## 340 "hippuritida"                <NA>                   13
## 341 "hippuritida"                <NA>                   13
## 342 "hippuritida"                <NA>                   13
## 343 "hippuritida"                <NA>                   13
## 344 "hippuritida"                <NA>                   13
## 345 "tetravaccinites collignoni" NO_FAMILY_SPECIFIED     3
## 346 "pectinida"                  <NA>                   13
## 347 "pectinida"                  <NA>                   13
## 348 "pectinida"                  <NA>                   13
## 349 "entolium"                   NO_FAMILY_SPECIFIED     5
## 350 "entolium"                   NO_FAMILY_SPECIFIED     5
## 351 "entolium"                   NO_FAMILY_SPECIFIED     5
## 352 "entolium"                   NO_FAMILY_SPECIFIED     5
## 353 "entolium"                   NO_FAMILY_SPECIFIED     5
## 354 "entolium"                   NO_FAMILY_SPECIFIED     5
## 355 "entolium"                   NO_FAMILY_SPECIFIED     5
## 356 "entolium"                   NO_FAMILY_SPECIFIED     5
## 357 "entolium"                   NO_FAMILY_SPECIFIED     5
## 358 "entolium"                   NO_FAMILY_SPECIFIED     5
## 359 "praecaprina"                NO_FAMILY_SPECIFIED     5
## 360 "pectinida"                  <NA>                   13
## 361 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 362 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 363 "entolium"                   NO_FAMILY_SPECIFIED     5
## 364 "entolium"                   NO_FAMILY_SPECIFIED     5
## 365 "entolium"                   NO_FAMILY_SPECIFIED     5
## 366 "entolium"                   NO_FAMILY_SPECIFIED     5
## 367 "entolium"                   NO_FAMILY_SPECIFIED     5
## 368 "entolium"                   NO_FAMILY_SPECIFIED     5
## 369 "entolium"                   NO_FAMILY_SPECIFIED     5
## 370 "hippuritida"                <NA>                   13
## 371 "ostreida"                   <NA>                   13
## 372 "ostreida"                   <NA>                   13
## 373 "entolium"                   NO_FAMILY_SPECIFIED     5
## 374 "entolium"                   NO_FAMILY_SPECIFIED     5
## 375 "entolium"                   NO_FAMILY_SPECIFIED     5
## 376 "entolium"                   NO_FAMILY_SPECIFIED     5
## 377 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 378 "entolium"                   NO_FAMILY_SPECIFIED     5
## 379 "entolium utukokense"        NO_FAMILY_SPECIFIED     3
## 380 "pectinoidea"                <NA>                   10
## 381 "hippuritida"                <NA>                   13
## 382 "entolium utukokense"        NO_FAMILY_SPECIFIED     3
## 383 "radiolitidina"              <NA>                   12
## 384 "ostreidina"                 <NA>                   12
## 385 "ostreidina"                 <NA>                   12
## 386 "ostreidina"                 <NA>                   12
## 387 "pectinoidea"                <NA>                   10
errant_taxonomy %>%
  group_by(rnk) %>%
  summarize(n = n())
## # A tibble: 6 × 2
##     rnk     n
##   <dbl> <int>
## 1     3   118
## 2     4     1
## 3     5   186
## 4    10    29
## 5    12     6
## 6    13    47
errant_taxonomy %>% pull(tna_trimmed) %>% unique()
##  [1] "entolium"                   "entolium membranaceum"     
##  [3] "balabania"                  "dechaseauxia"              
##  [5] "hardaghia"                  "entolium orbiculare"       
##  [7] "entolium demissus"          "ostreidina"                
##  [9] "megadiceras"                "hippuritacea"              
## [11] "hippuritida"                "entolium utukokense"       
## [13] "ostreacea"                  "rhedensia"                 
## [15] "praecaprina"                "entolium "                 
## [17] "ostreida"                   "tetravaccinites"           
## [19] "pseudopironaea"             "pectinoidea"               
## [21] "ostreoidea"                 "hatayia"                   
## [23] "branislavia"                "entolium irenense"         
## [25] "cryptaulia"                 "miseia"                    
## [27] "tetravaccinites collignoni" "pectinida"                 
## [29] "radiolitidina"

This finds 387 records that wither have no family or NA. Most are identified to genus level, but a similar number are species. It turns out that those with NA in the family are only identified to a higher taxonomic level, so we can leave them alone.

The code below instead shows the lower taxa without a family specified. These are ripe for inclusion.

errant_families <-
  bivalve_cleaned %>%
  filter(fml == "NO_FAMILY_SPECIFIED") %>%
  select(tna_trimmed, fml, rnk)
print(errant_families, n = Inf)
## # A tibble: 305 × 3
##     tna_trimmed                  fml                   rnk
##     <chr>                        <chr>               <dbl>
##   1 "entolium"                   NO_FAMILY_SPECIFIED     5
##   2 "entolium"                   NO_FAMILY_SPECIFIED     5
##   3 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##   4 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##   5 "entolium"                   NO_FAMILY_SPECIFIED     5
##   6 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##   7 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##   8 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##   9 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  10 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  11 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  12 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  13 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  14 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  15 "entolium"                   NO_FAMILY_SPECIFIED     5
##  16 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  17 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  18 "entolium"                   NO_FAMILY_SPECIFIED     5
##  19 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  20 "entolium"                   NO_FAMILY_SPECIFIED     5
##  21 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  22 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  23 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  24 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  25 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  26 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  27 "balabania"                  NO_FAMILY_SPECIFIED     5
##  28 "balabania"                  NO_FAMILY_SPECIFIED     5
##  29 "balabania"                  NO_FAMILY_SPECIFIED     5
##  30 "balabania"                  NO_FAMILY_SPECIFIED     5
##  31 "dechaseauxia"               NO_FAMILY_SPECIFIED     5
##  32 "hardaghia"                  NO_FAMILY_SPECIFIED     5
##  33 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  34 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  35 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  36 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  37 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  38 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  39 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  40 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  41 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  42 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  43 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  44 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  45 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  46 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  47 "entolium"                   NO_FAMILY_SPECIFIED     5
##  48 "entolium"                   NO_FAMILY_SPECIFIED     5
##  49 "entolium"                   NO_FAMILY_SPECIFIED     5
##  50 "entolium"                   NO_FAMILY_SPECIFIED     5
##  51 "entolium"                   NO_FAMILY_SPECIFIED     5
##  52 "entolium"                   NO_FAMILY_SPECIFIED     5
##  53 "entolium"                   NO_FAMILY_SPECIFIED     5
##  54 "entolium"                   NO_FAMILY_SPECIFIED     5
##  55 "entolium demissus"          NO_FAMILY_SPECIFIED     3
##  56 "entolium"                   NO_FAMILY_SPECIFIED     5
##  57 "entolium demissus"          NO_FAMILY_SPECIFIED     3
##  58 "entolium"                   NO_FAMILY_SPECIFIED     5
##  59 "entolium"                   NO_FAMILY_SPECIFIED     5
##  60 "entolium demissus"          NO_FAMILY_SPECIFIED     3
##  61 "entolium"                   NO_FAMILY_SPECIFIED     5
##  62 "entolium"                   NO_FAMILY_SPECIFIED     5
##  63 "entolium"                   NO_FAMILY_SPECIFIED     5
##  64 "entolium"                   NO_FAMILY_SPECIFIED     5
##  65 "entolium"                   NO_FAMILY_SPECIFIED     5
##  66 "entolium"                   NO_FAMILY_SPECIFIED     5
##  67 "entolium demissus"          NO_FAMILY_SPECIFIED     3
##  68 "entolium"                   NO_FAMILY_SPECIFIED     5
##  69 "entolium"                   NO_FAMILY_SPECIFIED     5
##  70 "entolium"                   NO_FAMILY_SPECIFIED     5
##  71 "entolium"                   NO_FAMILY_SPECIFIED     5
##  72 "entolium"                   NO_FAMILY_SPECIFIED     5
##  73 "entolium"                   NO_FAMILY_SPECIFIED     5
##  74 "entolium"                   NO_FAMILY_SPECIFIED     5
##  75 "entolium"                   NO_FAMILY_SPECIFIED     5
##  76 "entolium"                   NO_FAMILY_SPECIFIED     5
##  77 "entolium"                   NO_FAMILY_SPECIFIED     5
##  78 "entolium"                   NO_FAMILY_SPECIFIED     5
##  79 "entolium"                   NO_FAMILY_SPECIFIED     5
##  80 "entolium"                   NO_FAMILY_SPECIFIED     5
##  81 "entolium"                   NO_FAMILY_SPECIFIED     5
##  82 "entolium"                   NO_FAMILY_SPECIFIED     5
##  83 "entolium"                   NO_FAMILY_SPECIFIED     5
##  84 "entolium"                   NO_FAMILY_SPECIFIED     5
##  85 "entolium"                   NO_FAMILY_SPECIFIED     5
##  86 "entolium"                   NO_FAMILY_SPECIFIED     5
##  87 "entolium"                   NO_FAMILY_SPECIFIED     5
##  88 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  89 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  90 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  91 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  92 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  93 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
##  94 "entolium"                   NO_FAMILY_SPECIFIED     5
##  95 "entolium"                   NO_FAMILY_SPECIFIED     5
##  96 "megadiceras"                NO_FAMILY_SPECIFIED     5
##  97 "megadiceras"                NO_FAMILY_SPECIFIED     5
##  98 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
##  99 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 100 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 101 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 102 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 103 "entolium"                   NO_FAMILY_SPECIFIED     5
## 104 "entolium"                   NO_FAMILY_SPECIFIED     5
## 105 "entolium"                   NO_FAMILY_SPECIFIED     5
## 106 "entolium"                   NO_FAMILY_SPECIFIED     5
## 107 "entolium"                   NO_FAMILY_SPECIFIED     5
## 108 "entolium"                   NO_FAMILY_SPECIFIED     5
## 109 "entolium"                   NO_FAMILY_SPECIFIED     5
## 110 "entolium"                   NO_FAMILY_SPECIFIED     5
## 111 "entolium"                   NO_FAMILY_SPECIFIED     5
## 112 "entolium"                   NO_FAMILY_SPECIFIED     5
## 113 "entolium"                   NO_FAMILY_SPECIFIED     5
## 114 "entolium"                   NO_FAMILY_SPECIFIED     5
## 115 "entolium"                   NO_FAMILY_SPECIFIED     5
## 116 "entolium"                   NO_FAMILY_SPECIFIED     5
## 117 "entolium"                   NO_FAMILY_SPECIFIED     5
## 118 "entolium utukokense"        NO_FAMILY_SPECIFIED     3
## 119 "entolium utukokense"        NO_FAMILY_SPECIFIED     3
## 120 "entolium utukokense"        NO_FAMILY_SPECIFIED     3
## 121 "entolium utukokense"        NO_FAMILY_SPECIFIED     3
## 122 "entolium"                   NO_FAMILY_SPECIFIED     5
## 123 "entolium"                   NO_FAMILY_SPECIFIED     5
## 124 "entolium"                   NO_FAMILY_SPECIFIED     5
## 125 "entolium"                   NO_FAMILY_SPECIFIED     5
## 126 "entolium"                   NO_FAMILY_SPECIFIED     5
## 127 "entolium"                   NO_FAMILY_SPECIFIED     5
## 128 "entolium"                   NO_FAMILY_SPECIFIED     5
## 129 "entolium"                   NO_FAMILY_SPECIFIED     5
## 130 "entolium"                   NO_FAMILY_SPECIFIED     5
## 131 "entolium"                   NO_FAMILY_SPECIFIED     5
## 132 "entolium"                   NO_FAMILY_SPECIFIED     5
## 133 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 134 "rhedensia"                  NO_FAMILY_SPECIFIED     5
## 135 "entolium"                   NO_FAMILY_SPECIFIED     5
## 136 "entolium"                   NO_FAMILY_SPECIFIED     5
## 137 "entolium"                   NO_FAMILY_SPECIFIED     5
## 138 "entolium"                   NO_FAMILY_SPECIFIED     5
## 139 "entolium"                   NO_FAMILY_SPECIFIED     5
## 140 "entolium"                   NO_FAMILY_SPECIFIED     5
## 141 "entolium"                   NO_FAMILY_SPECIFIED     5
## 142 "entolium"                   NO_FAMILY_SPECIFIED     5
## 143 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 144 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 145 "entolium"                   NO_FAMILY_SPECIFIED     5
## 146 "entolium"                   NO_FAMILY_SPECIFIED     5
## 147 "praecaprina"                NO_FAMILY_SPECIFIED     5
## 148 "praecaprina"                NO_FAMILY_SPECIFIED     5
## 149 "entolium demissus"          NO_FAMILY_SPECIFIED     3
## 150 "entolium demissus"          NO_FAMILY_SPECIFIED     3
## 151 "entolium "                  NO_FAMILY_SPECIFIED     4
## 152 "entolium"                   NO_FAMILY_SPECIFIED     5
## 153 "entolium"                   NO_FAMILY_SPECIFIED     5
## 154 "entolium"                   NO_FAMILY_SPECIFIED     5
## 155 "entolium"                   NO_FAMILY_SPECIFIED     5
## 156 "entolium"                   NO_FAMILY_SPECIFIED     5
## 157 "entolium"                   NO_FAMILY_SPECIFIED     5
## 158 "entolium"                   NO_FAMILY_SPECIFIED     5
## 159 "entolium"                   NO_FAMILY_SPECIFIED     5
## 160 "entolium"                   NO_FAMILY_SPECIFIED     5
## 161 "entolium"                   NO_FAMILY_SPECIFIED     5
## 162 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 163 "tetravaccinites"            NO_FAMILY_SPECIFIED     5
## 164 "pseudopironaea"             NO_FAMILY_SPECIFIED     5
## 165 "entolium"                   NO_FAMILY_SPECIFIED     5
## 166 "entolium"                   NO_FAMILY_SPECIFIED     5
## 167 "entolium"                   NO_FAMILY_SPECIFIED     5
## 168 "entolium"                   NO_FAMILY_SPECIFIED     5
## 169 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 170 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 171 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 172 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 173 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 174 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 175 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 176 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 177 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 178 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 179 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 180 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 181 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 182 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 183 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 184 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 185 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 186 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 187 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 188 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 189 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 190 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 191 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 192 "hatayia"                    NO_FAMILY_SPECIFIED     5
## 193 "branislavia"                NO_FAMILY_SPECIFIED     5
## 194 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 195 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 196 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 197 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 198 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 199 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 200 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 201 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 202 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 203 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 204 "entolium"                   NO_FAMILY_SPECIFIED     5
## 205 "entolium"                   NO_FAMILY_SPECIFIED     5
## 206 "entolium"                   NO_FAMILY_SPECIFIED     5
## 207 "entolium"                   NO_FAMILY_SPECIFIED     5
## 208 "entolium"                   NO_FAMILY_SPECIFIED     5
## 209 "entolium"                   NO_FAMILY_SPECIFIED     5
## 210 "entolium"                   NO_FAMILY_SPECIFIED     5
## 211 "entolium"                   NO_FAMILY_SPECIFIED     5
## 212 "entolium"                   NO_FAMILY_SPECIFIED     5
## 213 "entolium"                   NO_FAMILY_SPECIFIED     5
## 214 "entolium"                   NO_FAMILY_SPECIFIED     5
## 215 "entolium"                   NO_FAMILY_SPECIFIED     5
## 216 "entolium"                   NO_FAMILY_SPECIFIED     5
## 217 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 218 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 219 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 220 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 221 "entolium"                   NO_FAMILY_SPECIFIED     5
## 222 "entolium"                   NO_FAMILY_SPECIFIED     5
## 223 "entolium"                   NO_FAMILY_SPECIFIED     5
## 224 "entolium"                   NO_FAMILY_SPECIFIED     5
## 225 "entolium"                   NO_FAMILY_SPECIFIED     5
## 226 "entolium"                   NO_FAMILY_SPECIFIED     5
## 227 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 228 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 229 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 230 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 231 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 232 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 233 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 234 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 235 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 236 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 237 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 238 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 239 "entolium membranaceum"      NO_FAMILY_SPECIFIED     3
## 240 "entolium"                   NO_FAMILY_SPECIFIED     5
## 241 "entolium"                   NO_FAMILY_SPECIFIED     5
## 242 "entolium irenense"          NO_FAMILY_SPECIFIED     3
## 243 "entolium irenense"          NO_FAMILY_SPECIFIED     3
## 244 "praecaprina"                NO_FAMILY_SPECIFIED     5
## 245 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 246 "hatayia"                    NO_FAMILY_SPECIFIED     5
## 247 "praecaprina"                NO_FAMILY_SPECIFIED     5
## 248 "branislavia"                NO_FAMILY_SPECIFIED     5
## 249 "entolium"                   NO_FAMILY_SPECIFIED     5
## 250 "entolium"                   NO_FAMILY_SPECIFIED     5
## 251 "entolium"                   NO_FAMILY_SPECIFIED     5
## 252 "entolium"                   NO_FAMILY_SPECIFIED     5
## 253 "entolium"                   NO_FAMILY_SPECIFIED     5
## 254 "cryptaulia"                 NO_FAMILY_SPECIFIED     5
## 255 "branislavia"                NO_FAMILY_SPECIFIED     5
## 256 "entolium"                   NO_FAMILY_SPECIFIED     5
## 257 "miseia"                     NO_FAMILY_SPECIFIED     5
## 258 "miseia"                     NO_FAMILY_SPECIFIED     5
## 259 "miseia"                     NO_FAMILY_SPECIFIED     5
## 260 "miseia"                     NO_FAMILY_SPECIFIED     5
## 261 "miseia"                     NO_FAMILY_SPECIFIED     5
## 262 "miseia"                     NO_FAMILY_SPECIFIED     5
## 263 "miseia"                     NO_FAMILY_SPECIFIED     5
## 264 "miseia"                     NO_FAMILY_SPECIFIED     5
## 265 "miseia"                     NO_FAMILY_SPECIFIED     5
## 266 "miseia"                     NO_FAMILY_SPECIFIED     5
## 267 "miseia"                     NO_FAMILY_SPECIFIED     5
## 268 "miseia"                     NO_FAMILY_SPECIFIED     5
## 269 "branislavia"                NO_FAMILY_SPECIFIED     5
## 270 "miseia"                     NO_FAMILY_SPECIFIED     5
## 271 "miseia"                     NO_FAMILY_SPECIFIED     5
## 272 "branislavia"                NO_FAMILY_SPECIFIED     5
## 273 "miseia"                     NO_FAMILY_SPECIFIED     5
## 274 "miseia"                     NO_FAMILY_SPECIFIED     5
## 275 "miseia"                     NO_FAMILY_SPECIFIED     5
## 276 "branislavia"                NO_FAMILY_SPECIFIED     5
## 277 "tetravaccinites collignoni" NO_FAMILY_SPECIFIED     3
## 278 "entolium"                   NO_FAMILY_SPECIFIED     5
## 279 "entolium"                   NO_FAMILY_SPECIFIED     5
## 280 "entolium"                   NO_FAMILY_SPECIFIED     5
## 281 "entolium"                   NO_FAMILY_SPECIFIED     5
## 282 "entolium"                   NO_FAMILY_SPECIFIED     5
## 283 "entolium"                   NO_FAMILY_SPECIFIED     5
## 284 "entolium"                   NO_FAMILY_SPECIFIED     5
## 285 "entolium"                   NO_FAMILY_SPECIFIED     5
## 286 "entolium"                   NO_FAMILY_SPECIFIED     5
## 287 "entolium"                   NO_FAMILY_SPECIFIED     5
## 288 "praecaprina"                NO_FAMILY_SPECIFIED     5
## 289 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 290 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 291 "entolium"                   NO_FAMILY_SPECIFIED     5
## 292 "entolium"                   NO_FAMILY_SPECIFIED     5
## 293 "entolium"                   NO_FAMILY_SPECIFIED     5
## 294 "entolium"                   NO_FAMILY_SPECIFIED     5
## 295 "entolium"                   NO_FAMILY_SPECIFIED     5
## 296 "entolium"                   NO_FAMILY_SPECIFIED     5
## 297 "entolium"                   NO_FAMILY_SPECIFIED     5
## 298 "entolium"                   NO_FAMILY_SPECIFIED     5
## 299 "entolium"                   NO_FAMILY_SPECIFIED     5
## 300 "entolium"                   NO_FAMILY_SPECIFIED     5
## 301 "entolium"                   NO_FAMILY_SPECIFIED     5
## 302 "entolium orbiculare"        NO_FAMILY_SPECIFIED     3
## 303 "entolium"                   NO_FAMILY_SPECIFIED     5
## 304 "entolium utukokense"        NO_FAMILY_SPECIFIED     3
## 305 "entolium utukokense"        NO_FAMILY_SPECIFIED     3

We can write this out to a file that can then be uploaded to LifeWatch and use that to compare against its databases.

errant_families %>%
  select(tna_trimmed) %>%
  unique() %>%
  transmute(ScientificName = tna_trimmed) %>%
  write_csv("./output/uncertain_taxa.csv")

This list of uncertain taxa can then be compared against an online data base, such as LifeWatch.

lifewatch_taxon_check <-
  read_csv("./output/19470_uncertain_taxa.csv")
## New names:
## * NOTUSED_scientificname -> NOTUSED_scientificname...6
## * NOTUSED_scientificname -> NOTUSED_scientificname...15
## * NOTUSED_scientificname -> NOTUSED_scientificname...20
## * NOTUSED_scientificname -> NOTUSED_scientificname...26
## * NOTUSED_scientificname -> NOTUSED_scientificname...30
## * ...
## Rows: 20 Columns: 64
## ── Column specification ──────────────────────────────────────────────────
## Delimiter: ","
## chr (31): scientificname, note_fuzzy_col, environment_aphia_WORMS, nam...
## dbl (16): required_fields_check, aphiaid_WORMS, scientificnameid, taxo...
## lgl (17): datasetname_fuzzy_col, taxonomicstatus_fuzzy_col, taxonrank_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(lifewatch_taxon_check)
## # A tibble: 20 × 64
##    scientificname       required_fields… datasetname_fuz… taxonomicstatus…
##    <chr>                           <dbl> <lgl>            <lgl>           
##  1 Entolium                            1 NA               NA              
##  2 Entolium membranace…                1 NA               NA              
##  3 Balabania                           1 NA               NA              
##  4 Dechaseauxia                        1 NA               NA              
##  5 Hardaghia                           1 NA               NA              
##  6 Entolium orbiculare                 1 NA               NA              
##  7 Entolium demissus                   1 NA               NA              
##  8 Megadiceras                         1 NA               NA              
##  9 Entolium utukokense                 1 NA               NA              
## 10 Rhedensia                           1 NA               NA              
## 11 Praecaprina                         1 NA               NA              
## 12 Entolium (Entolium)                 1 NA               NA              
## 13 Tetravaccinites                     1 NA               NA              
## 14 Pseudopironaea                      1 NA               NA              
## 15 Hatayia                             1 NA               NA              
## 16 Branislavia                         1 NA               NA              
## 17 Entolium irenense                   1 NA               NA              
## 18 Cryptaulia                          1 NA               NA              
## 19 Miseia                              1 NA               NA              
## 20 Tetravaccinites col…                1 NA               NA              
## # … with 60 more variables: taxonrank_fuzzy_col <lgl>,
## #   NOTUSED_scientificname...6 <lgl>, scientificnameid_fuzzy_col <lgl>,
## #   isextinct_fuzzy_col <lgl>, match_type_fuzzy_col <lgl>,
## #   match_count_fuzzy_col <lgl>, note_fuzzy_col <chr>,
## #   environment_aphia_WORMS <chr>, name_aphia_WORMS <chr>,
## #   aphiaid_WORMS <dbl>, NOTUSED_scientificname...15 <chr>,
## #   scientificnameid <dbl>, status_aphia_WORMS <chr>, …

Mapping data

We can also do more fun things like plot the occurrences on a map. Useful guides to this are found in three blog posts starting here. Working with mapping data is more involved than simple points, particularly if you want to change the projection away from the standard Mercator. The package sf (short for ‘simple features’, describing how the data is stored) can deal with all of this and then in plotting the projection can be changed. A rule of thumb is make sure to assign the coordinate reference system (CRS) when creating map objects.

Package rnaturalearth has regularly updated maps that should be better to use than in the built-in maps library. Here we download the country outlines in sf format.

world_map <-
  ne_countries(returnclass = "sf")

Next, convert the occurrences locations into an sf object also. I’ve filtered down to occurrences identified to species level and then taken a random sample (‘slice’) of 1000 rows. The final stage is convert to an sf object using the longitude and latitude columns, with the standard WGS84 CRS (indicated with 4326 in this case). You can find this code to use looking at epsg.io, searching for the projection, and using the code given. In the case of WGS84 coordinates (standard latitude and longitude) shown this is 4326.

occurrences_to_plot <-
  bivalve_cleaned %>%
  filter(rnk == 3) %>%
  slice_sample(n = 1000) %>%
  st_as_sf(coords = c("lng", "lat"), crs = 4326, agr = "constant")

Now, we put these together into the plot using the geom_sf function to build up the layers: map outline first, then points over the top. The colour scale is good for colour-blind people, so is a good option to prefer. Here I modify the map projection using coord_sf(crs) into the Robinson projection.

The essence of plotting with geom_sf is to add the different layers one by one. In this case, the first call plots the country outlines then the second line plots the occurrence points over the top; scale_colour_viridis is a good set of plotting colours to use as they are colour blind-friendly; and the final theme options remove the border and the grey background usual in ggplot plots.

ggplot() +
  geom_sf(data = world_map, inherit.aes = FALSE) +
  geom_sf(data = occurrences_to_plot, mapping = aes(colour = fml)) +
  scale_colour_viridis(discrete = TRUE) +
  coord_sf(crs = "+proj=robin") +
  theme_bw() +
  theme(
    legend.position = "none",
    panel.border = element_blank()
  )
Map of Bivalvia localities from the Cretaceous and Palaeogene.

Map of Bivalvia localities from the Cretaceous and Palaeogene.

Palaeogeographical map

We use the palaeogeographical reconstructions from the combined Muller2019-Young2019-Cao2020 (MYC) model provided by GPlates (https://www.gplates.org) (Müller et al. 2019; Young et al. 2019; X. Cao et al. 2022; Torsvik et al. 2019). This provides a nice visualization of the palaeogeography including locations of shallow marine environments, land, mountains, and ice caps. We exported these four layers reconstructed to 68 Ma in the latest Cretaceous and plot them as a base map with modern coastlines indicated. The data files for the reconstructions are provided with GPlates or can be downloaded from Earthbyte.

The different layers (shallow marine, land, mountain, ice cap) are included in four files in the ./data/palaeogeographical_reconstructions directory and read into a list then combined using the functions below. Combining these data sets into one makes it quicker to plot and colour the different layers in the base map.

palaeogeo_file_layers <-
  c("Shallow marine" = "sm", "Landmass" = "lm", "Mountain" = "m", "Ice cap" = "i")

read_geojson <-
  function(prefix, filename = "./data/palaeogeographical_reconstructions/id_402_2_reconstructed_68.00Ma.gmt") {
  # Read a list of GeoJSON files and combine into a single sf collection.
  #
  # Arguments:
  #   prefix: layer prefixes for the data files ("sm", "lm", "m", "i") taken from Cao et al. (2017).
  #   filename: rest of GeoJSON file name.
  #
  # Returns:
  #   A single feature collection with geometries and ID column ("id").
  str_replace(filename, "id", prefix) %>%
    st_read() %>%
    add_column(id = prefix)
}

map_data <-
  palaeogeo_file_layers %>%
    purrr::map(read_geojson) %>%
    do.call(rbind, .) %>%
    mutate(
      id = factor(id, levels = palaeogeo_file_layers, labels = names(palaeogeo_file_layers))
    )
## Reading layer `sm_402_2_reconstructed_68.00Ma' from data source 
##   `/Users/glbcm/GitHub/kpg_extinction/data/palaeogeographical_reconstructions/sm_402_2_reconstructed_68.00Ma.gmt' 
##   using driver `OGR_GMT'
## Simple feature collection with 829 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -83.18105 xmax: 180 ymax: 89.52032
## Geodetic CRS:  WGS 84
## Reading layer `lm_402_2_reconstructed_68.00Ma' from data source 
##   `/Users/glbcm/GitHub/kpg_extinction/data/palaeogeographical_reconstructions/lm_402_2_reconstructed_68.00Ma.gmt' 
##   using driver `OGR_GMT'
## Simple feature collection with 323 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 86.3155
## Geodetic CRS:  WGS 84
## Reading layer `m_402_2_reconstructed_68.00Ma' from data source 
##   `/Users/glbcm/GitHub/kpg_extinction/data/palaeogeographical_reconstructions/m_402_2_reconstructed_68.00Ma.gmt' 
##   using driver `OGR_GMT'
## Simple feature collection with 312 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -89.50942 xmax: 180 ymax: 84.56318
## Geodetic CRS:  WGS 84
## Reading layer `i_402_2_reconstructed_68.00Ma' from data source 
##   `/Users/glbcm/GitHub/kpg_extinction/data/palaeogeographical_reconstructions/i_402_2_reconstructed_68.00Ma.gmt' 
##   using driver `OGR_GMT'
## Simple feature collection with 4 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 4.391382 ymin: -83.96268 xmax: 103.195 ymax: 22.29246
## Geodetic CRS:  WGS 84
base_map <-
  ggplot() +
    geom_sf(data = map_data, aes(fill = id), colour = NA) +
    scale_discrete_manual(
      # colours for base map layers
      values =
        c(
          "Ice cap"        = "#DAD3FF",
          "Landmass"       = "#FFD23A",
          "Mountain"       = "#FF8D51",
          "Shallow marine" = "#45D8FF"
        ),
      aesthetics = c("fill"),
      name = "Palaeogeography"
    ) +
    theme_bw() +
    theme(panel.border = element_blank())

modern_coastlines <-
  st_read("./data/palaeogeographical_reconstructions/Global_EarthByte_GPlates_PresentDay_Coastlines_Polyline_reconstructed_68.00Ma.gmt")
## Reading layer `Global_EarthByte_GPlates_PresentDay_Coastlines_Polyline_reconstructed_68.00Ma' from data source `/Users/glbcm/GitHub/kpg_extinction/data/palaeogeographical_reconstructions/Global_EarthByte_GPlates_PresentDay_Coastlines_Polyline_reconstructed_68.00Ma.gmt' 
##   using driver `OGR_GMT'
## Simple feature collection with 1945 features and 23 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -87.88832 xmax: 180 ymax: 83.53398
## Geodetic CRS:  WGS 84
base_map +
  geom_sf(data = modern_coastlines, colour = "grey60") +
  coord_sf(crs = "+proj=robin")
Palaeogeographical reconstruction in the latest Cretaceous (68 Ma), with modern coastlines indicated, based on the environmental reconstrution of @Cao2017Bb and the combined rotaion models of @Muller2019T, @Young2019GF, and @Cao2022GR, with correction to the Pacific by @Torsvik2019GGG.

Palaeogeographical reconstruction in the latest Cretaceous (68 Ma), with modern coastlines indicated, based on the environmental reconstrution of W. Cao et al. (2017) and the combined rotaion models of Müller et al. (2019), Young et al. (2019), and X. Cao et al. (2022), with correction to the Pacific by Torsvik et al. (2019).

Rotating modern locality coordinates

We have the modern localities for our bivalve occurrences, but need to rotate these to their position at the end of the Cretaceous (68 Ma) to get their palaeocoordinates. The GPlates Web Service can do this online or through docker, but doesn’t offer the rotation model used for the combined MYC palaeogeographical reconstruction, and can be quite slow to do many requests — and we have tens-of-thousands of bivalve occurrences. Instead, we field this out to pyGPlates. This step of reconstruction has been done, so the file ./data/reconstructed_68Ma.gmt can be read in directly.

ek_occurrences <-
  bivalve_cleaned %>%
  filter(lag >= 66 & lag < 70)

modern_coords <-
  ek_occurrences %>%
  select(c("oid", "tna", "rnk", "eag", "lag", "fml", "lng", "lat", "tna_trimmed")) %>%
  st_as_sf(coords = c("lng", "lat"), crs = 4326)

st_write(modern_coords, "./data/kpg_bivalve_modern_coordinates.gmt", driver = "OGR_GMT")

We have to reduce the number of columns above as OGR_GMT seems only to save the first 68.

Instructions to install pyGPlates can be found here, including the necessary Python packages (specifically proj). Anaconda/Miniconda is a good Python managing system in which you can also create project environments. The code in the python blocks below can be run in a python environment — separate to the main R environment — or can be run directly within R using the reticulate package (Ushey, Allaire, and Tang 2022).

There are two steps to reconstructing the palaeocoordinates: (1) assigning the localities to the static polygons (i.e. plate fragments that move with continental drift), and (2) rotating the plate fragments and localities to the palaeolocations. First we assign the points and output a GPML file that can be visualized in GPlates directly.

import pygplates

# Load one or more rotation files into a rotation model.
rotation_model = pygplates.RotationModel('./data/palaeogeographical_reconstructions/Muller2019-Young2019-Cao2020_CombinedRotations.rot')

# Filename for the static plate boundaries
static_polygons_filename = './data/palaeogeographical_reconstructions/Global_EarthByte_GPlates_PresentDay_StaticPlatePolygons.gpmlz'

# Load some features.
point_features = pygplates.FeatureCollection('./data/kpg_bivalve_modern_coordinates.gmt')

# Output filename for assigned points
assigned_points_output = './data/bivalve_assigned_locations.gpml'

# Place point occurrences onto plate static polygons
assigned_points = pygplates.partition_into_plates(
    static_polygons_filename,
    rotation_model,
    point_features,
    properties_to_copy = [
        pygplates.PartitionProperty.reconstruction_plate_id,
        pygplates.PartitionProperty.valid_time_period
    ]
)

# Write assigned points to a GPML file that can be used in GPlates
assigned_points_collection = pygplates.FeatureCollection(assigned_points)
assigned_points_collection.write(assigned_points_output)

Next we rotate the palaeolocations to 68 Ma and export an OGR-GMT file of these palaeocoordinates.

# Reconstruct features to this geological time.
reconstruction_time = 68

# The filename of the exported reconstructed geometries.
# It's a shapefile called 'reconstructed_68Ma.shp'.
export_filename = './data/reconstructed_{0}Ma.gmt'.format(reconstruction_time)

# Reconstruct the features to the reconstruction time and export them to a shapefile.
pygplates.reconstruct(assigned_points_collection, rotation_model, export_filename, reconstruction_time)

These converted locations can be read back into R and plotted over the base map.

palaeo_coords <-
  st_read("./data/reconstructed_68Ma.gmt")
## Reading layer `reconstructed_68Ma' from data source 
##   `/Users/glbcm/GitHub/kpg_extinction/data/reconstructed_68Ma.gmt' 
##   using driver `OGR_GMT'
## Simple feature collection with 8974 features and 10 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -179.1851 ymin: -65.69038 xmax: 176.9147 ymax: 79.60964
## Geodetic CRS:  WGS 84
palaeo_coords <-
  palaeo_coords %>%
  add_column(
    plng = st_coordinates(palaeo_coords)[, 1],
    plat = st_coordinates(palaeo_coords)[, 2]
  )

base_map +
  geom_sf(data = palaeo_coords) +
  coord_sf(crs = "+proj=robin")
Palaeolocations of marine bivalve occurrences in the latest Cretaceous (70–66 Ma) plotted on a palaeogeographical map at 68 Ma.

Palaeolocations of marine bivalve occurrences in the latest Cretaceous (70–66 Ma) plotted on a palaeogeographical map at 68 Ma.

Palaeoenvironmental data

We’ve got palaeoenvironmental data from an analysis done by Ridgwell and Schmidt (2010). The data is stored in a netCDF file: this is a compact way to store data on geographical scales where you have values at various locations. It’s different to the mapping data above: this netCDF is raster with data in points, whereas the maps above were vector and had lines of data. Ridgwell and Schmidt (2010) did a reconstruction of the palaeoenvironment at the K/Pg boundary using Biogem from the GeNie package, one of a series of Earth-systems models. There are two steps to getting data from a netCDF file into R: (1) loading the metadata, then (2) pulling the data you want into R. I get some warning about deprecated tidyverse functions, but I don’t think this is anything to worry about (yet). This page has a good guide to getting started with netCDF.

metadata <-
  ncdump::NetCDF("./data/EXAMPLE.p0055c.RidgwellSchmidt2010.SPIN1/biogem/fields_biogem_3d.nc")

# look at the variables
print(metadata$variable, n = Inf)
## # A tibble: 48 × 16
##    name       ndims natts prec  units longname group_index storage shuffle
##    <chr>      <int> <int> <chr> <chr> <chr>          <int>   <dbl> <lgl>  
##  1 year           1     2 float ""    year               1       2 FALSE  
##  2 grid_level     2     5 int   "n/a" grid de…           1       1 FALSE  
##  3 grid_mask      2     4 float "n/a" land-se…           1       1 FALSE  
##  4 grid_topo      2     4 float "m"   ocean d…           1       1 FALSE  
##  5 ocn_temp       4     4 float "deg… tempera…           1       2 FALSE  
##  6 ocn_sal        4     4 float "PSU" salinity           1       2 FALSE  
##  7 ocn_DIC        4     4 float "mol… dissolv…           1       2 FALSE  
##  8 ocn_DIC_1…     4     4 float "o/o… d13C of…           1       2 FALSE  
##  9 ocn_PO4        4     4 float "mol… dissolv…           1       2 FALSE  
## 10 ocn_O2         4     4 float "mol… dissolv…           1       2 FALSE  
## 11 ocn_ALK        4     4 float "mol… alkalin…           1       2 FALSE  
## 12 ocn_DOM_C      4     4 float "mol… dissolv…           1       2 FALSE  
## 13 ocn_DOM_C…     4     4 float "o/o… d13C of…           1       2 FALSE  
## 14 ocn_DOM_P      4     4 float "mol… dissolv…           1       2 FALSE  
## 15 ocn_Ca         4     4 float "mol… dissolv…           1       2 FALSE  
## 16 ocn_SO4        4     4 float "mol… dissolv…           1       2 FALSE  
## 17 ocn_H2S        4     4 float "mol… dissolv…           1       2 FALSE  
## 18 ocn_Mg         4     4 float "mol… dissolv…           1       2 FALSE  
## 19 carb_H         4     2 float ""    carbona…           1       2 FALSE  
## 20 carb_fug_…     4     2 float ""    carbona…           1       2 FALSE  
## 21 carb_conc…     4     2 float ""    carbona…           1       2 FALSE  
## 22 carb_conc…     4     2 float ""    carbona…           1       2 FALSE  
## 23 carb_conc…     4     2 float ""    carbona…           1       2 FALSE  
## 24 carb_ohm_…     4     2 float ""    carbona…           1       2 FALSE  
## 25 carb_ohm_…     4     2 float ""    carbona…           1       2 FALSE  
## 26 carb_dCO3…     4     2 float ""    carbona…           1       2 FALSE  
## 27 carb_dCO3…     4     2 float ""    carbona…           1       2 FALSE  
## 28 carb_RF0       4     2 float ""    carbona…           1       2 FALSE  
## 29 diag_NH4_…     4     3 float "mol… product…           1       2 FALSE  
## 30 diag_NH4_…     4     3 float "mol… product…           1       2 FALSE  
## 31 diag_NO3_…     4     3 float "mol… product…           1       2 FALSE  
## 32 diag_NO3_…     4     3 float "mol… product…           1       2 FALSE  
## 33 diag_SO4_…     4     3 float "mol… product…           1       2 FALSE  
## 34 diag_SO4_…     4     3 float "mol… product…           1       2 FALSE  
## 35 diag_dCH4      4     3 float "mol… product…           1       2 FALSE  
## 36 phys_u         4     4 float "m/s" ocean v…           1       2 FALSE  
## 37 phys_v         4     4 float "m/s" ocean v…           1       2 FALSE  
## 38 phys_w         4     4 float "m/s" ocean v…           1       2 FALSE  
## 39 misc_pH        4     3 float "pH … ocean pH           1       2 FALSE  
## 40 misc_rCaC…     4     3 float "n/a" CaCO3 t…           1       2 FALSE  
## 41 misc_frac…     4     3 float "n/a" POC fra…           1       2 FALSE  
## 42 misc_frac…     4     3 float "n/a" CaCO3 f…           1       2 FALSE  
## 43 misc_rho       4     3 float "kg … ocean d…           1       2 FALSE  
## 44 grid_A         4     3 float "m2"  ocean c…           1       2 FALSE  
## 45 grid_dD        4     3 float "m2"  ocean c…           1       2 FALSE  
## 46 carb_d13C…     4     3 float "o/o… d13C of…           1       2 FALSE  
## 47 carb_d13C…     4     3 float "o/o… d13C of…           1       2 FALSE  
## 48 carb_d13C…     4     3 float "o/o… d13C of…           1       2 FALSE  
## # … with 7 more variables: compression <lgl>, unlim <lgl>,
## #   make_missing_value <lgl>, missval <dbl>, hasAddOffset <lgl>,
## #   hasScaleFact <lgl>, id <dbl>
# load the desired variables
kpg_env <-
  nc_open("./data/EXAMPLE.p0055c.RidgwellSchmidt2010.SPIN1/biogem/fields_biogem_3d.nc")

Loading the data gives us lots of lovely information. The important thing to note is that all the environment variables have values in fours components: longitude, latitude, depth, and time. The last of these is because GeNie runs for a set number of simulated years to reach a steady state. In this case it’s for 10,000 years, so we’ll want to use the values from year 10,000.

lon <-
  ncvar_get(kpg_env, "lon")
lat <-
  ncvar_get(kpg_env, "lat")
temp <-
  ncvar_get(kpg_env, "ocn_temp")
dim(temp)
## [1] 36 36 16 13

The dimensions of the temperature values are 36, 36, 16, and 13. It’s a four-dimensional array with 36 values each for longitude and latitude, 16 depth values, and 13 time values. The ordering means that rows are longitude, columns are latitude, and depth and time are printed as sequential matrices. We can plot this quickly.

image(lon, lat, temp[,,1,13], asp = 1)

But his isn’t too useful and needs some cleaning up.

miss_val <- ncatt_get(kpg_env, "ocn_temp", "missing_value")
temp[temp == miss_val$value] <- NA
surface_temp <- temp[,,1,13]

grid = expand.grid(lon = lon, lat = lat)
grid$ocn_temp <- c(surface_temp)

temp_plot <-
  st_as_stars(grid, dims = c("lon", "lat"), values = "ocn_temp", crs = 4326)

ggplot() +
  geom_stars(data = temp_plot) + #,  aes(x = lon, y = lat, fill = ocn_temp)) +
  geom_sf(data = modern_coastlines, colour = "grey60") +
  scale_fill_viridis(option = "inferno") +
  theme_void()

You can see that the coastlines and temperatures don’t quite match up. The projections must be different. We can either reproject the coastlines onto the 2010 data or find some more recent temperature data using this outlines.

Pulling out values

The last step here is to pull out the environmental values for each of the occurrences.

occ_coord <-
  palaeo_coords %>%
  as_data_frame() %>%
  select(c("plng", "plat")) %>%
  SpatialPoints()
temp_coord <- SpatialPoints(grid[complete.cases(grid),])

# from https://stackoverflow.com/a/27444208
nearest_points <-
  apply(gDistance(temp_coord, occ_coord, byid = TRUE), 1, which.min)

The code above identifies the row number in the expanded temperature data that corresponds to the location for each occurrence. We can then extract and add that as a new column to palaeo_coords.

occ_temp <-
  grid[complete.cases(grid),]$ocn_temp[nearest_points]

palaeo_coords <-
  palaeo_coords %>%
  add_column(pocn_temp = occ_temp)

Finding extinctions

We find taxa that go extinct by charting those present before the K/Pg boundary that aren’t present afterwards. Initially, do this for each species: we group into families later. Filtering out the species that go extinct, we choose those that have their last appearances after 70 Ma but before 66 Ma; you can be more precise and do after 68 Ma, but I don’t know whether the PBDB is dated precise enough to make much difference here.

species_extinctions <-
  bivalve_cleaned %>%
    filter(
      rnk == 3,
      lag >= 66 & lag <= 70
    ) %>%
    group_by(tna_trimmed) %>%
    summarize(kpg_ext = 1)

In the context of families, we can see the proportions of the families that go extinct. We choose species from the Maastrichtian and Dining, join on the previous table identifying those species that go extinct, then group into families and count the proportion that go extinction at the K/Pg boundary.

fml_extinction_proportion <-
  bivalve_cleaned %>%
  filter(
    rnk == 3,
    lag >= 61.2 & lag <= 72 & eag >= 66
  ) %>%
  left_join(species_extinctions, by = "tna_trimmed") %>%
  select(tna_trimmed, fml, kpg_ext) %>%
  distinct(.keep_all = TRUE) %>%
  mutate(kpg_ext = replace(kpg_ext, is.na(kpg_ext), 0)) %>%
  group_by(fml, kpg_ext) %>%
  summarize(n = n()) %>%
  mutate(ext_prop = n / sum(n)) %>%
  filter(kpg_ext == 1) # select the extinction proportion
## `summarise()` has grouped output by 'fml'. You can override using the
## `.groups` argument.
fml_extinction_proportion %>%
  ggplot(aes(x = fml, y = ext_prop)) +
  geom_col() +
  coord_flip()
Proportions of bivalve families that go extinction across the Cretaceous/palaeogene boundary.

Proportions of bivalve families that go extinction across the Cretaceous/palaeogene boundary.

This last plot may also benefit from showing the number of species for each family (to gauge the relative magnitude) and another version showing the absolute numbers.

Starting modelling

These last chunks essentially bring together most of the things we need to start modelling. These first few bits below will start with showing the general workflow for modelling extinctions across the Cretaceous–Palaeogene boundary and can then be readily extended to ask other or more specific questions. There are two levels to look at when modelling the extinction of these Bivalve taxa: (1) the family level, and (2) the species level.

Family extinctions
Using the proportion of the family that goes extinct as the variable to predict.
Species extinctions
Predicting the extinction of species as an absolute: whether it does or does not go extinct.

You’ll be familiar with linear modelling, typified by the equation \(y = mx + c\). In our case, \(y\) is extinction — the value we’re trying to predict — while \(x\) is the predictor that we think has an effect, or drives extinction — temperature, area, salinity etc. In other language, extinction is the dependent variable while temperature, area, salinity are the independent variables. Additional complexity comes in because standard linear modelling assumes that the dependent variable is normally-distributed. What \(y = mx + c\) is shorthand for is this:

\[ y \sim \textrm{Normal} \left( \mu, \sigma \right) \\ \mu = mx + c, \]

where \(\mu\) is the mean and \(\sigma\) is a constant standard deviation. You’re predictors, \(x\) are use to work out the mean of a normal distribution, \(\mu\), which has a fixed standard deviation, \(\sigma\). Going through the normal distribution is where the ‘randomness’ inherent in the data comes from and accounts for the spread of the data about the best-fit line. Using this normal distribution model is fine for heights or lengths or other things like that, but we have two different examples:

  1. Family extinctions is a proportion, so must between 0–1. You could force a very narrow normal distribution with a small standard deviation, but the maths extends to infinity no matter how unlikely the probably may be.
  2. Species extinctions will be a yes/no on whether that species goes extinct (included as 1 or 0). The normal distribution is continuous, so cannot easily be applied to discrete values like this.

This is where the extension of generalized linear modelling comes in: this relaxes the requirement to use a normal distribution for the dependent variable, so now we can predict proportions or discrete values.

Family extinctions

Family extinctions we’ve calculated as the proportion of the family that goes extinction (between 0–1). Therefore, to model this we need to use a distribution that extends between 0–1, but not beyond that. This is prime territory for the beta distribution, so our model instead looks like this:

\[ E_i \sim \textrm{Beta} \left( \bar{p}_i, \theta \right) \\ \textrm{logit} \left( \bar{p}_i \right) = m_x x_i + c \\ m_x \sim \textrm{Normal} \left( 0, 10 \right) \\ c \sim \textrm{Normal} \left( 0, 10 \right) \\ \theta = \phi + 2 \\ \phi \sim \textrm{Normal} \left( 0, 10 \right), \]

where \(E\) is the proportion of the family that goes extinct, \(\bar{p}\) is the Beta distribution mean and \(\theta\) its shape parameter, \(x\) is the predictor, \(m\) and \(c\) are the coefficient and intercept of the linear model. For the Beta distribution, \(\theta\) determines the shape: whether the probabilities are condensed (e.g. 0.4–0.6) into the middle or focused around the edges (e.g. 0 or 1) with \(\theta = 2\) giving a flat distribution. Here we add the extra variable \(\phi\) so that \(\theta\) is centred around 2. The logit expression is needed to convert the ingoing model into a value between 0–1 as is required for \(\bar{p}\).

The additional feature here is that this is a more Bayesian realization of generalized linear modelling: the parameters in the model (\(\bar{p}, m_x, c, and \phi\)) all have probability distributions associated with them. In essence, rather that heading towards definite values, this model and these distributions incorporate uncertainty in measurements and models so that we end up with probabilities for a variety of different results, with the ‘best’ linear model having the highest probability. The model can’t tell you the exact level of extinction that a certain driver might generate (like with a very simplified linear model might suggest), but will give you the probability of different amounts of extinction, with one level hopefully being the most probable.

Most of the modelling can be done in R, but often is ‘slow’: adding complicated probability distributions is difficult, especially as you add more data and more complex models. This is added to R repeatedly having to keep checking between the data, working out what calculations it needs to do to get to the model values, then tell the computer to do those things. Instead we’ll use the package brms that uses an external language called Stan. Stan is designed specifically for generalized linear modelling and is also compiled: this means that it starts by taking the model and making a little program that can then skip past all the copying and checking R does and goes straight the computer and calculates. Fast.

The other feature of Stan, and one that’s often used when Bayesian inference is involved, is rather than sampling and calculating the probabilities directly, which is difficult, is instead to push some values through and calculate a result, then shift the parameters and calculate it again. By doing this repeatedly, you can then build up a picture of what the final distribution looks like without having to calculate it in full, complicated detail. This is called Monte Carlo sampling (MC) and often has some extra details called Metropolis Coupled Monte Carlo (MCMC).

brms is an R package that creates the Stan code from the data and model and tells Stan to run an MCMC analysis then parses the results ready to be plotted. It installs all the bits needed, so we don’t need to leave R to do these analyses.

Spread of occurrences

This first model will use the family proportions of extinction and will model this related to the amount of geographical space that the family encompasses. This space will be calculated using a minimum spanning tree (MST). An MST plots the shortest branching path between all the points included in a set of data: the more spread out the points, the longer the MST length. MSTDist in the chunk below calculates the MST for each family at the K/Pg boundary, and the results are joined to the proportions calculated earlier.

family_mst_data <-
  palaeo_coords %>%
  st_drop_geometry() %>%
  group_by(fml) %>%
  summarize(
    n_occs = n(),
    mst = MSTDist(plng, plat)$MST_km
  ) %>%
  select(fml, n_occs, mst) %>%
  left_join(fml_extinction_proportion, by = "fml")

family_mst_data %>%
  ggplot(aes(x = fml, y = mst)) +
  geom_col() +
  coord_flip()

family_mst_data %>%
  ggplot(aes(x = mst, y = ext_prop)) +
  geom_point()

This last plot is a good one to check as it’s the linear model we want to test: extinction against MST length. There’s hints of a positive relationship but the data are certainly well spread.

A few rows have NA for their extinction values; these are the families that don’t appear until after the K/Pg boundary. We’ll remove these in a moment, but we’re nearly ready to go in into modelling. The last bit will be to standardize the MST values: change all the values so that they have a mean 0 and standard deviation of 1. This reduces the effects of scale when doing the modelling; we will convert back afterwards when we need to.

mst_scaling <-
  scale(family_mst_data$mst)

family_mst_scaled <-
  family_mst_data %>%
  mutate(mst_stand = mst_scaling[, 1]) %>%
  filter(!is.na(ext_prop)) %>%
  select(fml, mst_stand, ext_prop)

Now we generate the linear model. This takes the form above, but I’ll replace a few of the parameter symbols to match that we’re using MST:

\[ E_i \sim \textrm{ZOIBeta} \left( \bar{p}_i, \phi \right) \\ \textrm{logit} \left( \bar{p}_i \right) = \alpha + \beta_M M_i \\ \beta_M \sim \textrm{Normal} \left(0, 10 \right) \\ \alpha \sim \textrm{Normal} \left( 0, 10 \right) \\ \phi \sim \textrm{Gamma} \left( 0.01, 0.01 \right), \]

where \(E_i\) is the proportion of the family that goes extinct, \(M_i\) is the scaled minimum spanning tree for the whole family, \(\alpha\) is the intercept, and \(\beta_M\) is the coefficient for \(M_i\) in the linear model. Here we use a Zero-One-Inflated Beta distribution (ZOIBeta) as the data has many proportions of 1.0 for our extinctions; these are likely a combination of total extinction and small numbers of species.

We now implement this in brms. This borrows the formula notation used in base R linear models, but adds in the extras needed for generalized linear modelling. The first thing that’s useful to do is to show what parameters need to be defined initially with get_priors.

get_prior(
  ext_prop ~ 0 + mst_stand,
  data = family_mst_scaled,
  family = zero_one_inflated_beta()
)
##              prior class      coef group resp dpar nlpar lb ub
##             (flat)     b                                      
##             (flat)     b mst_stand                            
##         beta(1, 1)   coi                                  0  1
##  gamma(0.01, 0.01)   phi                                  0   
##         beta(1, 1)   zoi                                  0  1
##        source
##       default
##  (vectorized)
##       default
##       default
##       default

Here we see how the model is defined with ext_prop ~ mst_stand, i.e. the extinction proportion is related to the standardized MST length. Then we specify the data and the family or distribution to be used in the model, in this case the Beta distribution. This also includes the logit function needed. You can see all the distributions and several useful examples for brms in the vignettes and its website. A list of the available families is here or in R help (?brmsfamily).

brms is clever in setting out all the bits of the model we need and providing sensible defaults. We can change these prior distributions (i.e. those we start the model with) in the main function using prior as you can see below. Running the model itself uses the same form in the function brm: call the formula, data, and family, then set any priors. Below we also setup brm to sample from the prior, so we can see the before and after, and then some housekeeping on how many jumps we want the MCMC analysis to do. More jumps samples better but takes more time, and the algorithm is clever so that it learns as it goes; thus after a while adding more iterations (iter) doesn’t change the results and are wasted. Getting there quicker is helped by having more chains, which are essentially repeats of the analysis that can be combined, and using more cores on your computer so that you do several of these repeated analyses at the same time.

Run this code block and then we’ll see what the output looks like.

model_family_mst <-
  brm(
    ext_prop ~ mst_stand,
    data = family_mst_scaled,
    family = zero_one_inflated_beta(),
    prior = prior(normal(0, 10), coef = "mst_stand", class = "b") +
      prior(normal(0, 10), class = "Intercept"),
    sample_prior = TRUE,
    iter = 4000, chains = 4, cores = 4,
    save_pars = save_pars()
  )
## Compiling Stan program...
## Start sampling

Once the brm function is run, the Stan program is compiled and then the analysis proceeds. If you’re running the same analysis again then it shouldn’t require recompiling and goes straight to the analysis.

summary(model_family_mst)
##  Family: zero_one_inflated_beta 
##   Links: mu = logit; phi = identity; zoi = identity; coi = identity 
## Formula: ext_prop ~ mst_stand 
##    Data: family_mst_scaled (Number of observations: 32) 
##   Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
##          total post-warmup draws = 8000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.57      0.17     0.23     0.90 1.00     8459     5488
## mst_stand     0.31      0.18    -0.04     0.66 1.00     9120     5610
## 
## Family Specific Parameters: 
##     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## phi    10.08      3.17     4.90    17.20 1.00     8970     5813
## zoi     0.38      0.08     0.23     0.55 1.00    10927     6026
## coi     0.93      0.07     0.76     1.00 1.00     7157     3644
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
plot(model_family_mst)

pp_check(model_family_mst, ndraws = 100)

conditional_effects(model_family_mst)

The last line, conditional_effects plots the relationship between the input and output, i.e. the ‘line of best fit’ between minimum spanning tree and family extinction proportion. This will probably the most interesting and useful. You can use the argument plot = FALSE to save the values separately.

Species extinctions

Species extinctions are instead measured as a no/yes, coded as 0/1, so we instead use the binomial distribution generalized linear model. The binomial distribution shows the probability of success or failure for a given number of attempts (think coin tosses and getting heads or tails). In our case, the number of attempts is 1: a species can only go extinct once! (This makes it a special case of the binomial distribution called the Bernoulli distribution.) The parameter \(p\) then becomes the probability that the species will go extinct; this is what we’re trying to predict.

\[ E \sim \textrm{Bernoulli} \left( p_i \right) \\ \textrm{logit} \left( p_i \right) = \alpha + \beta_T T_i \\ \alpha \sim \textrm{Normal} \left( 0, 10 \right) \\ \beta_T \sim \textrm{Normal} \left( 0, 10 \right), \]

where \(E\) is whether a species goes extinct or not, \(p\) is the probability of extinction, \(T\) is the average temperature for that species, and \(\alpha\) and \(\beta\) are the coefficients.

In fact, we can go a step further and use all the occurrences rather than grouping by species and using the temperature at each occurrence as the input. This requires a multilevel or hierarchical model. Rather than grouping the family data (and summarizing with a mean value or similar), we instead input the individual occurrences and model each one and the trend across whole species. The model is the same except for the second line:

\[ \textrm{logit} \left( p_i \right) 1= \alpha + \beta_{T[species]} T_i, \]

where \(\beta_{T[species]}\) now indicates the coefficient on a per-species basis. We create our data.

species_temp_scaled <-
  palaeo_coords %>%
  st_drop_geometry() %>%
  filter(rnk == 3) %>%
  left_join(species_extinctions, by = "tna_trimmed") %>%
  mutate(
    tna_trimmed = as_factor(tna_trimmed),
    pocn_temp_scaled = scale(pocn_temp)[, 1]
  ) %>%
  select(tna_trimmed, pocn_temp_scaled, kpg_ext, fml, plng, plat)

Implementing this in brms is similarly not too difficult to do either:

get_prior(
  kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed),
  data = species_temp_scaled,
  family = binomial()
)
##                 prior     class             coef       group resp dpar
##                lkj(1)       cor                                       
##                lkj(1)       cor                  tna_trimmed          
##  student_t(3, 0, 2.5) Intercept                                       
##  student_t(3, 0, 2.5)        sd                                       
##  student_t(3, 0, 2.5)        sd                  tna_trimmed          
##  student_t(3, 0, 2.5)        sd        Intercept tna_trimmed          
##  student_t(3, 0, 2.5)        sd pocn_temp_scaled tna_trimmed          
##  nlpar lb ub       source
##                   default
##              (vectorized)
##                   default
##         0         default
##         0    (vectorized)
##         0    (vectorized)
##         0    (vectorized)

The syntax (pocn_temp_scaled | tna_trimmed) estimates the temperature coefficient for each species grouping. This creates a correlation matrix with the model (lkj(1) in the priors). We then run the model itself. This will be slower than the family-level analysis because it is estimating much more from more data.

model_species_temp <-
  brm(
    # kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed),
    kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed + fml),
    data = species_temp_scaled,
    family = bernoulli(),
    sample_prior = TRUE,
    iter = 4000, chains = 4, cores = 4,
    save_pars = save_pars()
  )
## Compiling Stan program...
## Start sampling

And we can summarize with these.

summary(model_species_temp)
##  Family: bernoulli 
##   Links: mu = logit 
## Formula: kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed + fml) 
##    Data: species_temp_scaled (Number of observations: 3354) 
##   Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
##          total post-warmup draws = 8000
## 
## Group-Level Effects: 
## ~fml (Number of levels: 31) 
##                                 Estimate Est.Error l-95% CI u-95% CI Rhat
## sd(Intercept)                       1.89      1.78     0.07     6.41 1.00
## sd(pocn_temp_scaled)                1.19      1.15     0.04     4.15 1.00
## cor(Intercept,pocn_temp_scaled)     0.02      0.58    -0.95     0.95 1.00
##                                 Bulk_ESS Tail_ESS
## sd(Intercept)                       5459     3633
## sd(pocn_temp_scaled)                4887     3807
## cor(Intercept,pocn_temp_scaled)     6788     3901
## 
## ~tna_trimmed (Number of levels: 200) 
##                                 Estimate Est.Error l-95% CI u-95% CI Rhat
## sd(Intercept)                       1.78      1.67     0.06     6.00 1.00
## sd(pocn_temp_scaled)                1.06      0.99     0.03     3.60 1.00
## cor(Intercept,pocn_temp_scaled)     0.02      0.58    -0.95     0.94 1.00
##                                 Bulk_ESS Tail_ESS
## sd(Intercept)                       6404     3929
## sd(pocn_temp_scaled)                5430     3196
## cor(Intercept,pocn_temp_scaled)     9980     4749
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    25.94     21.24     9.63    74.96 1.00     1482      816
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
plot(model_species_temp)

pp_check(model_species_temp, ndraws = 100)

It may not appear different to the family run above, but noticed under the summary that there are 200 levels in the group-level effects: these are the coefficients for each species.

It may also be worth grouping by family too with kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed | fml) or kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed + fml). Have a look at this this document and ?brmsformula for a description of the formula syntax that brms uses.

More advanced plotting

The summary plots above give some good starting formation, but more detailed plotting is needed to fully understand the output. This is where the bayesplot package comes in: it holds many useful functions for easily plotting the output of brms and other Stan outputs. Have a look at this link for more examples.

Posterior intervals

Posterior intervals are probably the most useful outputs. These show the range of the output coefficients and their mean values. In other words, these are the outputs we’d want to see to tell whether there is a strong relationship between out predictor and response. The relative size can also tell whether certain species might be more strongly affected. In each case1, we first pull out the results, or draws from the model object, then use a function from bayesplot to plot the results.

posterior <- as_draws_array(model_species_temp)

mcmc_intervals(posterior, pars = vars(!matches("prior|lp__|b_|sd_|cor_")))

ggsave(file = "./fig/model-species-family-intervals.pdf", height = 80, limitsize = FALSE)
## Saving 7 x 80 in image

The parameters are estimated for all 200 or so species, hence the not very useful plot. Also, because some of the values are large, the coefficients are compressed. The desired parameters can be specified with the pars argument; parnames gives a list of the parameters to choose from.

#parnames(model_species_temp)
mcmc_areas(
  posterior,
  pars = vars(contains("arca"), starts_with("b_Intercept"))
)

Adding in correlation

Some amount of correlation will come because the occurrences are all next to each other and the temperatures vary across that. We can add an extra term gp to the formula to describe this autocorrelation based on the palaeocoordinates. This describes a Gaussian process (and a multivariate one at that), which builds in the autocorrelation between the occurrence positions. The implementation below is inspired by this one.

In this case the Gaussian process we want works on the distances between each occurrence: closer occurrences have more similar values with the model (often on the standard deviation). Thus, we give the Gaussian process the palaeocoordinates; brms then generates the correlation structure that Stan uses in the model and MCMC.

Warning, this will be slow! (Multiple hours to complete.) I suggest playing with the models above then thinking about using this one later. When using multiple chains, the time taken for each chain can also be very different as the starting points are randomized.

model_species_temp_autocorr <-
  brm(
    kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed + fml) + gp(plng, plat, scale = FALSE),
    data = species_temp_scaled,
    family = bernoulli(),
    sample_prior = TRUE,
    iter = 100, chains = 1, cores = 1,
    save_pars = save_pars()
  )

Another source of covariation is phylogeny, which I haven’t done yet but can include later.

References

Arel-Bundock, Vincent, Nils Enevoldsen, and CJ Yetman. 2018. “Countrycode: An R Package to Convert Country Names and Country Codes.” Journal of Open Source Software 3 (28): 848. https://doi.org/10.21105/joss.00848.
Cao, Wenchao, Sabin Zahirovic, Nicolas Flament, Simon Williams, Jan Golonka, and R. Dietmar Müller. 2017. “Improving Global Paleogeography Since the Late Paleozoic Using Paleobiology.” Biogeosciences 14 (23): 5425–39. https://doi.org/10.5194/bg-14-5425-2017.
Cao, Xianzhi, Sabin Zahirovic, Sanzhong Li, Yanhui Suo, Pengcheng Wang, Jinping Liu, and R. Dietmar Müller. 2022. “A Deforming Plate Tectonic Model of the South China Block Since the Jurassic.” Gondwana Research, Tectonic Evolution of Ocean-Continent Connection Zones, 102 (February): 3–16. https://doi.org/10.1016/j.gr.2020.11.010.
Garnier, Simon, Ross, Noam, Rudis, Robert, Camargo, et al. 2021. viridis - Colorblind-Friendly Color Maps for r (version 0.6.2). https://doi.org/10.5281/zenodo.4679424.
Müller, R. Dietmar, Sabin Zahirovic, Simon E. Williams, John Cannon, Maria Seton, Dan J. Bower, Michael G. Tetley, et al. 2019. “A Global Plate Model Including Lithospheric Deformation Along Major Rifts and Orogens Since the Triassic.” Tectonics 38 (6): 1884–1907. https://doi.org/10.1029/2018TC005462.
Ooms, Jeroen. 2014. “The Jsonlite Package: A Practical and Consistent Mapping Betwe En JSON Data and r Objects.” arXiv. https://doi.org/10.48550/arXiv.1403.2805.
Pebesma, Edzer. 2018. “Simple Features for r: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
Ridgwell, Andy, and Daniela N. Schmidt. 2010. “Past Constraints on the Vulnerability of Marine Calcifiers to Massive Carbon Dioxide Release.” Nature Geoscience 3 (3, 3): 196–200. https://doi.org/10.1038/ngeo755.
South, Andy. 2017a. Rnaturalearth: World Map Data from Natural Earth (version 0.1.0). https://CRAN.R-project.org/package=rnaturalearth.
———. 2017b. Rnaturalearthdata: World Vector Map Data from Natural Earth Used in ’Rnaturalearth’. Manual. https://CRAN.R-project.org/package=rnaturalearthdata.
Torsvik, Trond H., Bernhard Steinberger, Grace E. Shephard, Pavel V. Doubrovine, Carmen Gaina, Mathew Domeier, Clinton P. Conrad, and William W. Sager. 2019. “Pacific-Panthalassic Reconstructions: Overview, Errata and the Way Forward.” Geochemistry, Geophysics, Geosystems 20 (7): 3659–89. https://doi.org/10.1029/2019GC008402.
Ushey, Kevin, JJ Allaire, and Yuan Tang. 2022. Reticulate: Interface to ’Python (version 1.24). https://CRAN.R-project.org/package=reticulate.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Young, Alexander, Nicolas Flament, Kayla Maloney, Simon Williams, Kara Matthews, Sabin Zahirovic, and R. Dietmar Müller. 2019. “Global Kinematics of Tectonic Plates and Subduction Zones Since the Late Paleozoic Era.” Geoscience Frontiers, Special Issue: Advances in Himalayan Tectonics, 10 (3): 989–1013. https://doi.org/10.1016/j.gsf.2018.05.011.
Zizka, Alexander, Daniele Silvestro, Tobias Andermann, Josue Azevedo, Camila Duarte Ritter, Daniel Edler, Harith Farooq, et al. 2019. CoordinateCleaner: Standardized Cleaning of Occurrence Records from Biological Collection Databases.” Methods in Ecology and Evolution, no. 10: –7. https://doi.org/10.1111/2041-210X.13152.